ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Cheng-Lin Liu; Dimosthenis Karatzas; Jean-Christophe Burie; Jean-Marc Ogier; Jiri Matas; Michal Busta; Nibal Nayef; Pinaki Nath Chowdhury; Umapada Pal; Wafa Khlif

arxiv: 1907.00945 · v1 · pith:KSBJSSBTnew · submitted 2019-07-01 · 💻 cs.CV

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

Nibal Nayef , Yash Patel , Michal Busta , Pinaki Nath Chowdhury , Dimosthenis Karatzas , Wafa Khlif , Jiri Matas , Umapada Pal

show 3 more authors

Jean-Christophe Burie Cheng-Lin Liu Jean-Marc Ogier

This is my paper

Pith reviewed 2026-05-25 11:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-lingual scene texttext detectionscript classificationend-to-end recognitionsynthetic datasetICDAR competitionbenchmark evaluation

0 comments

The pith

A 2019 challenge benchmarks multi-lingual scene text detection and recognition using 20,000 images across 10 languages plus synthetic data and four tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports the setup and results of the RRC-MLT-2019 competition, which expands an earlier version by adding an end-to-end recognition task, one more language, and a large synthetic training set. The real dataset contains 20,000 images with scene text from 10 languages. Four tasks are defined to measure progress on detection, script classification of cropped words, joint detection plus classification, and complete end-to-end detection and recognition. The competition drew 60 submissions from research and industry groups. The paper presents the dataset construction, task definitions, and the observed performance levels to serve as a public benchmark.

Core claim

The RRC-MLT-2019 challenge supplies a 20,000-image real dataset covering text in 10 languages together with a large multi-lingual synthetic set and defines four tasks—text detection, cropped-word script classification, joint detection and classification, and end-to-end recognition—to enable systematic comparison of methods and to drive advances in multi-lingual scene text processing.

What carries the argument

The four-task evaluation protocol built on the 20,000-image real dataset and the accompanying synthetic training data.

If this is right

Detection and recognition pipelines can now be scored on identical multi-lingual data.
Script classification can be tested both in isolation and when coupled with detection.
Synthetic data can be used to supplement limited real training examples for end-to-end systems.
Performance numbers from the 60 submissions establish current reference levels for each task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The joint detection-plus-classification task may encourage architectures that handle script identification without separate stages.
Results on the end-to-end task could reveal whether current pipelines remain language-specific or are becoming more universal.
The dataset size and language coverage set a scale that later competitions might need to exceed to stay relevant.

Load-bearing premise

The chosen 20,000 real images and the added synthetic images together capture enough variety of multi-lingual scene text to serve as a lasting benchmark.

What would settle it

A method that ranks high on all four tasks yet shows markedly lower accuracy on a fresh collection of scene images containing the same ten languages but collected under different conditions would indicate the benchmark does not generalize.

read the original abstract

With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard competition report that releases a 20k-image multi-lingual dataset and defines four tasks, with no new methods or analysis.

read the letter

This paper is a report on the RRC-MLT-2019 challenge. It extends the 2017 version by adding an end-to-end detection and recognition task, one more language to the real images, a large synthetic training set, and a baseline end-to-end method. The real dataset has 20,000 images across 10 languages, and the challenge drew 60 submissions. The four tasks are text detection, cropped word script classification, joint detection and classification, and end-to-end recognition. The paper lays out the dataset, tasks, and participation numbers in straightforward terms. It does a reasonable job documenting the resources and the scale of interest from both research and industry groups. The factual details on sizes and counts line up with the abstract and show no internal contradictions. There are no derivations, models, or quantitative claims that need extra assumptions to hold, so the circularity burden stays at zero. The main limitation is that it stays descriptive; it does not test whether the added synthetic data or the new language mix actually improves methods or closes gaps in multi-lingual performance. The representativeness point is mentioned but not examined in depth, which is typical for this kind of report and not a load-bearing flaw here. Researchers working on scene text detection and recognition, especially those needing multi-lingual benchmarks, will find the dataset and task definitions useful. A methods-focused reading group would probably skip it, but a group that tracks datasets and challenges might want to see the resources. I would not cite the paper itself in new work. It deserves peer review because the dataset release and task definitions can serve as a reference point for the subfield even without novel technical results.

Referee Report

0 major / 1 minor

Summary. The manuscript is the report for the ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition (RRC-MLT-2019). It describes the construction of a dataset consisting of 20,000 real images with text in 10 languages, augmented by a large-scale synthetic multi-lingual dataset for training. The challenge defines four tasks: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification, and (d) end-to-end detection and recognition. A baseline end-to-end method is provided, and the paper summarizes the 60 submissions received from research and industrial communities, presenting the dataset, tasks, and findings.

Significance. This work is significant as it extends prior challenges with additional languages, an end-to-end task, and synthetic data to address data scarcity in multi-lingual scene text research. By documenting 60 submissions, it provides insight into current state-of-the-art approaches and serves as a reference point for future work in the field. The public release of the dataset and protocols, if executed as described, will enable reproducible benchmarking.

minor comments (1)

[Abstract] Abstract: the abstract mentions 'findings of the presented RRC-MLT-2019 challenge' but does not specify what metrics or top results are highlighted; including a brief summary of top performances would improve the abstract's informativeness.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation to accept the manuscript. The report accurately summarizes the RRC-MLT-2019 challenge, its dataset, tasks, and outcomes.

Circularity Check

0 steps flagged

No significant circularity: purely descriptive competition report

full rationale

The paper is a standard competition report. It describes the release of a 20k-image multi-lingual dataset, definition of four tasks, addition of a synthetic training set, a baseline method, and summary of 60 submissions. No equations, derivations, predictions, fitted parameters, or first-principles claims appear anywhere in the document. The central content is factual reporting of challenge organization and participation; the stated goal of benchmarking does not rely on any self-referential reduction or load-bearing self-citation chain. All content is externally verifiable via the released dataset and public submissions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a competition report with no mathematical content, derivations, or modeling; therefore it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5771 in / 1174 out tokens · 39013 ms · 2026-05-25T11:48:53.476345+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Icdar2017 robust reading chall enge on multi-lingual scene text detection and script identiﬁcati on - rrc-mlt,

N. Nayef, F. Yin, I. Bizid, H. Choi, Y . Feng, D. Karatzas, Z . Luo, U. Pal, C. Rigaud, J. Chazalon, W. Khlif, M. M. Luqman, J.-C. B urie, C.-L. Liu, and J.-M. Ogier, “Icdar2017 robust reading chall enge on multi-lingual scene text detection and script identiﬁcati on - rrc-mlt,” in ICDAR, 2017

work page 2017
[2]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

A. V eit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

ICDAR 2015 competiti on on robust reading,

D. Karatzas, L. G. i Bigorda, A. Nicolaou, S. Ghosh, A. Bag danov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. V alveny, “ICDAR 2015 competiti on on robust reading,” in ICDAR, 2015

work page 2015
[4]

ICDAR 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bi gorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almaz` an, and L. P . de las Hera s, “ICDAR 2013 robust reading competition,” in ICDAR, 2013

work page 2013
[5]

Improving patch-based scene text script identiﬁcation with ensembles of conjoine d networks,

L. G. i Bigorda, A. Nicolaou, and D. Karatzas, “Improving patch-based scene text script identiﬁcation with ensembles of conjoine d networks,” Pattern Recognition, 2017

work page 2017
[6]

Unconstrained scene text and video text recognition for arabic script,

M. Jain, M. Mathew, and C. Jawahar, “Unconstrained scene text and video text recognition for arabic script,” in ASAR, 2017

work page 2017
[7]

End-to-end interpretation of the french street name signs dataset,

R. Smith, C. Gu, D.-S. Lee, H. Hu, R. Unnikrishnan, J. Ibar z, S. Arnoud, and S. Lin, “End-to-end interpretation of the french street name signs dataset,” in Computer Vision – ECCV 2016 W orkshops , 2016, pp. 411– 426

work page 2016
[8]

Benchmarking scen e text recognition in devanagari, telugu and malayalam,

M. Mathew, M. Jain, and C. V . Jawahar, “Benchmarking scen e text recognition in devanagari, telugu and malayalam,” in ICDAR-MOCR W orkshop, 2017

work page 2017
[9]

Downtown osaka sce ne text dataset,

M. T. M. N. S. H. I. Y . K. K. Iwamura, M., “Downtown osaka sce ne text dataset,” in ECCV IWRR W orkshop, 2016

work page 2016
[10]

Mengchao and Y

H. Mengchao and Y . Zhibo. (2018) Icpr mtwi multi-type web images. [Online]. Available: https://tianchi.aliyun.com/competition/entrance/231651/introduction

work page 2018
[11]

Script identiﬁcation in the w ild via dis- criminative convolutional neural network,

B. Shi, X. Bai, and C. Y ao, “Script identiﬁcation in the w ild via dis- criminative convolutional neural network,” Pattern Recognition, 2016

work page 2016
[12]

ICDAR2015 competition on video script identiﬁcation (cvs i 2015),

N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenste in, “ICDAR2015 competition on video script identiﬁcation (cvs i 2015),” in ICDAR, 2015

work page 2015
[13]

A si mple and effective solution for script identiﬁcation in the wild,

A. K. Singh, A. Mishra, P . Dabral, and C. V . Jawahar, “A si mple and effective solution for script identiﬁcation in the wild,” i n DAS, 2016

work page 2016
[14]

A ﬁne-grained approach to scene text script identiﬁcation,

L. G. i Bigorda and D. Karatzas, “A ﬁne-grained approach to scene text script identiﬁcation,” in DAS, 2016

work page 2016
[15]

E2E-MLT – an unconstr ained end- to-end method for multi-language scene text,

M. Buˇ sta, Y . Patel, and J. Matas, “E2E-MLT – an unconstr ained end- to-end method for multi-language scene text,” ACCV IWRR W orkshop, 2018

work page 2018
[16]

Synthetic data for text localisation in natural images,

A. Gupta, A. V edaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in CVPR, 2016

work page 2016
[17]

Conto ur detection and hierarchical image segmentation,

P . Arbel´ aez, M. Maire, C. Fowlkes, and J. Malik, “Conto ur detection and hierarchical image segmentation,” PAMI, 2010

work page 2010
[18]

Multiscale combinatorial grouping,

P . Arbel´ aez, J. Pont-Tuset, J. T. Barron, F. Marques, a nd J. Malik, “Multiscale combinatorial grouping,” in CVPR, 2014

work page 2014
[19]

Semantic im age segmentation via deep parsing network,

Z. Liu, X. Li, P . Luo, C.-C. Loy, and X. Tang, “Semantic im age segmentation via deep parsing network,” in ICCV, 2015

work page 2015
[20]

Random sample consensu s: a paradigm for model ﬁtting with applications to image analys is and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensu s: a paradigm for model ﬁtting with applications to image analys is and automated cartography,” Communications of the ACM , 1981

work page 1981
[21]

Pyramid Mask Text Detector

J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu, “Pyra mid mask text detector,” arXiv preprint arXiv:1903.11800 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[22]

Character reg ion awareness for text detection,

Y . Baek, B. Lee, D. Han, S. Y un, and H. Lee, “Character reg ion awareness for text detection,” in CVPR, 2019

work page 2019
[23]

What is wrong with scene text recognition model comparison s? dataset and model analysis,

J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Y un, S. J. Oh, an d H. Lee, “What is wrong with scene text recognition model comparison s? dataset and model analysis,” arXiv preprint arXiv:1904.01906 , 2019

work page arXiv 1904

[1] [1]

Icdar2017 robust reading chall enge on multi-lingual scene text detection and script identiﬁcati on - rrc-mlt,

N. Nayef, F. Yin, I. Bizid, H. Choi, Y . Feng, D. Karatzas, Z . Luo, U. Pal, C. Rigaud, J. Chazalon, W. Khlif, M. M. Luqman, J.-C. B urie, C.-L. Liu, and J.-M. Ogier, “Icdar2017 robust reading chall enge on multi-lingual scene text detection and script identiﬁcati on - rrc-mlt,” in ICDAR, 2017

work page 2017

[2] [2]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

A. V eit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

ICDAR 2015 competiti on on robust reading,

D. Karatzas, L. G. i Bigorda, A. Nicolaou, S. Ghosh, A. Bag danov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. V alveny, “ICDAR 2015 competiti on on robust reading,” in ICDAR, 2015

work page 2015

[4] [4]

ICDAR 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bi gorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almaz` an, and L. P . de las Hera s, “ICDAR 2013 robust reading competition,” in ICDAR, 2013

work page 2013

[5] [5]

Improving patch-based scene text script identiﬁcation with ensembles of conjoine d networks,

L. G. i Bigorda, A. Nicolaou, and D. Karatzas, “Improving patch-based scene text script identiﬁcation with ensembles of conjoine d networks,” Pattern Recognition, 2017

work page 2017

[6] [6]

Unconstrained scene text and video text recognition for arabic script,

M. Jain, M. Mathew, and C. Jawahar, “Unconstrained scene text and video text recognition for arabic script,” in ASAR, 2017

work page 2017

[7] [7]

End-to-end interpretation of the french street name signs dataset,

R. Smith, C. Gu, D.-S. Lee, H. Hu, R. Unnikrishnan, J. Ibar z, S. Arnoud, and S. Lin, “End-to-end interpretation of the french street name signs dataset,” in Computer Vision – ECCV 2016 W orkshops , 2016, pp. 411– 426

work page 2016

[8] [8]

Benchmarking scen e text recognition in devanagari, telugu and malayalam,

M. Mathew, M. Jain, and C. V . Jawahar, “Benchmarking scen e text recognition in devanagari, telugu and malayalam,” in ICDAR-MOCR W orkshop, 2017

work page 2017

[9] [9]

Downtown osaka sce ne text dataset,

M. T. M. N. S. H. I. Y . K. K. Iwamura, M., “Downtown osaka sce ne text dataset,” in ECCV IWRR W orkshop, 2016

work page 2016

[10] [10]

Mengchao and Y

H. Mengchao and Y . Zhibo. (2018) Icpr mtwi multi-type web images. [Online]. Available: https://tianchi.aliyun.com/competition/entrance/231651/introduction

work page 2018

[11] [11]

Script identiﬁcation in the w ild via dis- criminative convolutional neural network,

B. Shi, X. Bai, and C. Y ao, “Script identiﬁcation in the w ild via dis- criminative convolutional neural network,” Pattern Recognition, 2016

work page 2016

[12] [12]

ICDAR2015 competition on video script identiﬁcation (cvs i 2015),

N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenste in, “ICDAR2015 competition on video script identiﬁcation (cvs i 2015),” in ICDAR, 2015

work page 2015

[13] [13]

A si mple and effective solution for script identiﬁcation in the wild,

A. K. Singh, A. Mishra, P . Dabral, and C. V . Jawahar, “A si mple and effective solution for script identiﬁcation in the wild,” i n DAS, 2016

work page 2016

[14] [14]

A ﬁne-grained approach to scene text script identiﬁcation,

L. G. i Bigorda and D. Karatzas, “A ﬁne-grained approach to scene text script identiﬁcation,” in DAS, 2016

work page 2016

[15] [15]

E2E-MLT – an unconstr ained end- to-end method for multi-language scene text,

M. Buˇ sta, Y . Patel, and J. Matas, “E2E-MLT – an unconstr ained end- to-end method for multi-language scene text,” ACCV IWRR W orkshop, 2018

work page 2018

[16] [16]

Synthetic data for text localisation in natural images,

A. Gupta, A. V edaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in CVPR, 2016

work page 2016

[17] [17]

Conto ur detection and hierarchical image segmentation,

P . Arbel´ aez, M. Maire, C. Fowlkes, and J. Malik, “Conto ur detection and hierarchical image segmentation,” PAMI, 2010

work page 2010

[18] [18]

Multiscale combinatorial grouping,

P . Arbel´ aez, J. Pont-Tuset, J. T. Barron, F. Marques, a nd J. Malik, “Multiscale combinatorial grouping,” in CVPR, 2014

work page 2014

[19] [19]

Semantic im age segmentation via deep parsing network,

Z. Liu, X. Li, P . Luo, C.-C. Loy, and X. Tang, “Semantic im age segmentation via deep parsing network,” in ICCV, 2015

work page 2015

[20] [20]

Random sample consensu s: a paradigm for model ﬁtting with applications to image analys is and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensu s: a paradigm for model ﬁtting with applications to image analys is and automated cartography,” Communications of the ACM , 1981

work page 1981

[21] [21]

Pyramid Mask Text Detector

J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu, “Pyra mid mask text detector,” arXiv preprint arXiv:1903.11800 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[22] [22]

Character reg ion awareness for text detection,

Y . Baek, B. Lee, D. Han, S. Y un, and H. Lee, “Character reg ion awareness for text detection,” in CVPR, 2019

work page 2019

[23] [23]

What is wrong with scene text recognition model comparison s? dataset and model analysis,

J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Y un, S. J. Oh, an d H. Lee, “What is wrong with scene text recognition model comparison s? dataset and model analysis,” arXiv preprint arXiv:1904.01906 , 2019

work page arXiv 1904