Brno Mobile OCR Dataset
Pith reviewed 2026-05-25 11:05 UTC · model grok-4.3
The pith
The Brno Mobile OCR Dataset supplies 19,728 annotated mobile photos of scientific papers to test document text recognition under typical phone-camera degradations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Brno Mobile OCR Dataset consists of 2,113 unique scientific paper pages photographed 19,728 times using 23 different mobile devices under varied conditions, accompanied by annotations for 500,000 text lines and an evaluation server with a hidden test set. A convolutional-recurrent baseline trained with CTC loss achieves word error rates of 2 percent, 22 percent and 73 percent on easy, medium and hard partitions.
What carries the argument
The B-MOD dataset of mobile-captured document images with line-level text annotations and difficulty partitions.
If this is right
- Enables training and testing of OCR models specifically for mobile device images.
- The split into easy, medium and hard parts allows graded evaluation of robustness to capture artifacts.
- Supports additional tasks such as line localization, layout analysis, image restoration and binarization.
- Provides a standardized benchmark through the public evaluation server and hidden test set.
Where Pith is reading between the lines
- Improved methods on this dataset could translate directly to higher accuracy in mobile apps for scanning documents and receipts.
- Because the source material is limited to scientific papers, performance gains may not transfer equally to forms, books or handwritten text.
- Pairing the dataset with dedicated restoration networks offers a testable route to lowering the reported 73 percent error on hard cases.
Load-bearing premise
The 23 devices and capture conditions produce a representative sample of the non-uniform lighting, blur, noise, sharpening and compression artifacts that occur in typical handheld mobile document photography.
What would settle it
An OCR method that achieves word error rates below 10 percent on the hard subset of the non-public test set would indicate the baseline does not fully capture the difficulty introduced by mobile artifacts.
Figures
read the original abstract
We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform lighting, image blur, strong noise, built-in denoising, sharpening, compression and other artifacts present in many photographs from mobile devices. This dataset contains 2 113 unique pages from random scientific papers, which were photographed by multiple people using 23 different mobile devices. The resulting 19 728 photographs of various visual quality are accompanied by precise positions and text annotations of 500k text lines. We further provide an evaluation methodology, including an evaluation server and a testset with non-public annotations. We provide a state-of-the-art text recognition baseline build on convolutional and recurrent neural networks trained with Connectionist Temporal Classification loss. This baseline achieves 2 %, 22 % and 73 % word error rates on easy, medium and hard parts of the dataset, respectively, confirming that the dataset is challenging. The presented dataset will enable future development and evaluation of document analysis for low-quality images. It is primarily intended for line-level text recognition, and can be further used for line localization, layout analysis, image restoration and text binarization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Brno Mobile OCR Dataset (B-MOD), a collection of 19,728 photographs of 2,113 unique scientific paper pages captured with 23 mobile devices under varied conditions. It supplies precise line-level text and position annotations for approximately 500k lines, an evaluation server with a non-public test set, and a CNN+RNN+CTC baseline reporting word error rates of 2%, 22%, and 73% on easy, medium, and hard partitions, respectively. The work positions the resource as filling a gap for document OCR under real mobile capture artifacts (non-uniform lighting, blur, noise, sharpening, compression) not addressed by existing scanned-document or scene-text corpora.
Significance. If the annotations prove accurate and the device/condition sampling representative, the dataset supplies a needed benchmark for line-level recognition, localization, and restoration methods under realistic mobile conditions. The hidden test set and public evaluation server are concrete strengths that support reproducible comparison. The reported baseline WER progression directly illustrates the dataset's intended difficulty gradient.
minor comments (2)
- [Abstract] Abstract: the criteria used to define the easy/medium/hard partitions (and how they relate to the 23 devices or capture conditions) are not stated; this partitioning detail should be made explicit in §3 or §4 so that the 2/22/73 % WER numbers can be interpreted without ambiguity.
- [Dataset description] The manuscript states that 23 devices were used but does not tabulate per-device image counts or artifact statistics; adding a small table or histogram in the dataset description section would strengthen the claim of diversity.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation to accept the manuscript.
Circularity Check
No significant circularity
full rationale
This is a dataset introduction paper with no mathematical derivation chain, first-principles predictions, or fitted parameters. The central contribution is the B-MOD collection of 19728 images with 500k line annotations plus an evaluation server; the CNN+RNN+CTC baseline is presented as a standard off-the-shelf recognizer whose reported 2/22/73 % WER figures simply document dataset difficulty rather than claiming any novel prediction that could reduce to its own inputs. No self-citation load-bearing, ansatz smuggling, or renaming of known results occurs. The work is therefore self-contained against external benchmarks with score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
V . Va ˇsek, V . Franc, and M. Urban, “License plate recognition and super-resolution from low-resolution videos by convolutional neural networks,” in Proc. of British Machine Vision Conference , September
-
[2]
Available: ftp://cmp.felk.cvut.cz/pub/cmp/articles/franc/ Vasek-LPR-BMVC2018.pdf
[Online]. Available: ftp://cmp.felk.cvut.cz/pub/cmp/articles/franc/ Vasek-LPR-BMVC2018.pdf
-
[3]
Icdar2015 competition on smartphone document capture and ocr (smartdoc),
J.-C. Burie, J. Chazalon, M. Coustaty, S. Eskenazi, M. M. Luqman, M. Mehri, N. Nayef, J.-M. Ogier, S. Prum, and M. Rusi ˜nol, “Icdar2015 competition on smartphone document capture and ocr (smartdoc),” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1161–1165
work page 2015
-
[4]
N. Nayef, M. M. Luqman, S. Prum, S. Eskenazi, J. Chazalon, and J.- M. Ogier, “Smartdoc-qa: A dataset for quality assessment of smartphone captured document images-single and multiple distortions,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1231–1235
work page 2015
-
[5]
Smartatid: A mobile captured arabic text images dataset for multi- purpose recognition tasks,
F. Chabchoub, Y . Kessentini, S. Kanoun, V . Eglin, and F. Lebourgeois, “Smartatid: A mobile captured arabic text images dataset for multi- purpose recognition tasks,” in Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on . IEEE, 2016, pp. 120–125
work page 2016
-
[6]
The iam-database: an english sentence database for offline handwriting recognition,
U.-V . Marti and H. Bunke, “The iam-database: an english sentence database for offline handwriting recognition,” International Journal on Document Analysis and Recognition , vol. 5, no. 1, pp. 39–46, Nov
-
[7]
Available: https://doi.org/10.1007/s100320200071
[Online]. Available: https://doi.org/10.1007/s100320200071
-
[8]
RIMES evaluation campaign for handwritten mail processing,
E. Augustin, J.-m. Brodin, M. Carr, E. Geoffrois, E. Grosicki, and F. Prł- teux, “RIMES evaluation campaign for handwritten mail processing,” in Proc. of the Workshop on Frontiers in Handwriting Recognition , no. 1, 2006
work page 2006
-
[9]
Icdar2017 competition on handwritten text recognition on the read dataset,
J. A. S ´anchez, V . Romero, A. H. Toselli, M. Villegas, and E. Vidal, “Icdar2017 competition on handwritten text recognition on the read dataset,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , vol. 1. IEEE, 2017, pp. 1383– 1388
work page 2017
-
[10]
The impact dataset of historical document images,
C. Papadopoulos, S. Pletschacher, C. Clausner, and A. Antonacopoulos, “The impact dataset of historical document images,” in Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing. ACM, 2013, pp. 123–130
work page 2013
-
[11]
Ic- dar2017 competition on recognition of early indian printed documents- reid2017,
C. Clausner, A. Antonacopoulos, T. Derrick, and S. Pletschacher, “Ic- dar2017 competition on recognition of early indian printed documents- reid2017,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1411–1416
work page 2017
-
[12]
The enp image and ground truth dataset of historical newspapers,
C. Clausner, C. Papadopoulos, S. Pletschacher, and A. Antonacopoulos, “The enp image and ground truth dataset of historical newspapers,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2015, pp. 931–935
work page 2015
-
[13]
Icdar 2015 competition on robust reading,
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on . IEEE, 2015, pp. 1156–1160
work page 2015
-
[14]
Icdar2017 robust reading challenge on coco-text,
R. Gomez, B. Shi, L. Gomez, L. Numann, A. Veit, J. Matas, S. Belongie, and D. Karatzas, “Icdar2017 robust reading challenge on coco-text,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 1435–1443
work page 2017
-
[15]
Downtown osaka scene text dataset,
M. Iwamura, T. Matsuda, N. Morimoto, H. Sato, Y . Ikeda, and K. Kise, “Downtown osaka scene text dataset,” in European Conference on Computer Vision. Springer, 2016, pp. 440–455
work page 2016
-
[16]
cBAD: ICDAR2017 competition on baseline detection,
M. Diem, F. Kleber, S. Fiel, T. Gruning, and B. Gatos, “cBAD: ICDAR2017 competition on baseline detection,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, nov 2017. [Online]. Available: https://doi.org/10. 1109/icdar.2017.222
work page 2017
-
[17]
The page (page analysis and ground-truth elements) format framework,
S. Pletschacher and A. Antonacopoulos, “The page (page analysis and ground-truth elements) format framework,” in 2010 20th International Conference on Pattern Recognition . IEEE, 2010, pp. 257–260
work page 2010
-
[18]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376
work page 2006
-
[19]
L. Kang, J. I. Toledo, P. Riba, M. Villegas, A. Forn ´es, and M. Rusinol, “Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition,” in German Conference on Pattern Recognition. Springer, 2018, pp. 459–472
work page 2018
-
[20]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30 , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available: ...
work page 2017
-
[21]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[22]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
K. Cho, B. Van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.