pith. sign in

arxiv: 1907.01307 · v1 · pith:MOL3TNUJnew · submitted 2019-07-02 · 💻 cs.CV

Brno Mobile OCR Dataset

Pith reviewed 2026-05-25 11:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords mobile OCRdocument datasettext recognitioncomputer visionBrno Mobile OCR Datasethandheld photographyword error rateCTC loss
0
0 comments X

The pith

The Brno Mobile OCR Dataset supplies 19,728 annotated mobile photos of scientific papers to test document text recognition under typical phone-camera degradations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a dataset of document images captured by handheld mobile devices to support OCR research that must handle the lighting variation, blur, noise, sharpening and compression artifacts absent from scanned-page or scene-text collections. The collection covers 2,113 unique pages photographed with 23 devices, yielding nearly 20,000 images and 500,000 line-level text annotations together with a public evaluation server and hidden test set. A convolutional-recurrent baseline trained with CTC loss records 2 percent, 22 percent and 73 percent word error rates on easy, medium and hard partitions. If the captured conditions are representative, the resource will let researchers measure and improve robustness for practical mobile document reading. The work therefore centers on supplying the missing training and benchmark material rather than on a new recognition algorithm.

Core claim

The Brno Mobile OCR Dataset consists of 2,113 unique scientific paper pages photographed 19,728 times using 23 different mobile devices under varied conditions, accompanied by annotations for 500,000 text lines and an evaluation server with a hidden test set. A convolutional-recurrent baseline trained with CTC loss achieves word error rates of 2 percent, 22 percent and 73 percent on easy, medium and hard partitions.

What carries the argument

The B-MOD dataset of mobile-captured document images with line-level text annotations and difficulty partitions.

If this is right

  • Enables training and testing of OCR models specifically for mobile device images.
  • The split into easy, medium and hard parts allows graded evaluation of robustness to capture artifacts.
  • Supports additional tasks such as line localization, layout analysis, image restoration and binarization.
  • Provides a standardized benchmark through the public evaluation server and hidden test set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved methods on this dataset could translate directly to higher accuracy in mobile apps for scanning documents and receipts.
  • Because the source material is limited to scientific papers, performance gains may not transfer equally to forms, books or handwritten text.
  • Pairing the dataset with dedicated restoration networks offers a testable route to lowering the reported 73 percent error on hard cases.

Load-bearing premise

The 23 devices and capture conditions produce a representative sample of the non-uniform lighting, blur, noise, sharpening and compression artifacts that occur in typical handheld mobile document photography.

What would settle it

An OCR method that achieves word error rates below 10 percent on the hard subset of the non-public test set would indicate the baseline does not fully capture the difficulty introduced by mobile artifacts.

Figures

Figures reproduced from arXiv: 1907.01307 by Martin Ki\v{s}\v{s}, Michal Hradi\v{s}, Old\v{r}ich Kodym.

Figure 1
Figure 1. Figure 1: Dataset creation. Random pages are augmented by lo [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Histogram of used devices. dataset of Indian books [9], and The ENP Image and Ground Truth Dataset of Historical Newspapers [10]. Another large group of datasets focuses on natural scene text recognition (e.g. Incidental Scene Text [11], COCO-Text [12] or DOST [13]). III. B-MOD DATASET An overview of the process used to create the dataset is shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative examples of photographs in the B-MOD [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histogram of the number of photographs taken per [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the baseline detection pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Samples of lines. Column (a) shows easy lines, (b) shows medium lines and hard lines are in (c). [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Lines extracted per template. 0 20 40 60 80 100 120 Number of characters in line 0 50000 100000 150000 200000 Number of lines [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Histogram of the number of character in extracted lines. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
read the original abstract

We introduce the Brno Mobile OCR Dataset (B-MOD) for document Optical Character Recognition from low-quality images captured by handheld mobile devices. While OCR of high-quality scanned documents is a mature field where many commercial tools are available, and large datasets of text in the wild exist, no existing datasets can be used to develop and test document OCR methods robust to non-uniform lighting, image blur, strong noise, built-in denoising, sharpening, compression and other artifacts present in many photographs from mobile devices. This dataset contains 2 113 unique pages from random scientific papers, which were photographed by multiple people using 23 different mobile devices. The resulting 19 728 photographs of various visual quality are accompanied by precise positions and text annotations of 500k text lines. We further provide an evaluation methodology, including an evaluation server and a testset with non-public annotations. We provide a state-of-the-art text recognition baseline build on convolutional and recurrent neural networks trained with Connectionist Temporal Classification loss. This baseline achieves 2 %, 22 % and 73 % word error rates on easy, medium and hard parts of the dataset, respectively, confirming that the dataset is challenging. The presented dataset will enable future development and evaluation of document analysis for low-quality images. It is primarily intended for line-level text recognition, and can be further used for line localization, layout analysis, image restoration and text binarization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the Brno Mobile OCR Dataset (B-MOD), a collection of 19,728 photographs of 2,113 unique scientific paper pages captured with 23 mobile devices under varied conditions. It supplies precise line-level text and position annotations for approximately 500k lines, an evaluation server with a non-public test set, and a CNN+RNN+CTC baseline reporting word error rates of 2%, 22%, and 73% on easy, medium, and hard partitions, respectively. The work positions the resource as filling a gap for document OCR under real mobile capture artifacts (non-uniform lighting, blur, noise, sharpening, compression) not addressed by existing scanned-document or scene-text corpora.

Significance. If the annotations prove accurate and the device/condition sampling representative, the dataset supplies a needed benchmark for line-level recognition, localization, and restoration methods under realistic mobile conditions. The hidden test set and public evaluation server are concrete strengths that support reproducible comparison. The reported baseline WER progression directly illustrates the dataset's intended difficulty gradient.

minor comments (2)
  1. [Abstract] Abstract: the criteria used to define the easy/medium/hard partitions (and how they relate to the 23 devices or capture conditions) are not stated; this partitioning detail should be made explicit in §3 or §4 so that the 2/22/73 % WER numbers can be interpreted without ambiguity.
  2. [Dataset description] The manuscript states that 23 devices were used but does not tabulate per-device image counts or artifact statistics; adding a small table or histogram in the dataset description section would strengthen the claim of diversity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept the manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a dataset introduction paper with no mathematical derivation chain, first-principles predictions, or fitted parameters. The central contribution is the B-MOD collection of 19728 images with 500k line annotations plus an evaluation server; the CNN+RNN+CTC baseline is presented as a standard off-the-shelf recognizer whose reported 2/22/73 % WER figures simply document dataset difficulty rather than claiming any novel prediction that could reduce to its own inputs. No self-citation load-bearing, ansatz smuggling, or renaming of known results occurs. The work is therefore self-contained against external benchmarks with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper; the central contribution is the collection and annotation of images rather than any derivation that would require free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5788 in / 976 out tokens · 33510 ms · 2026-05-25T11:05:46.591978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    License plate recognition and super-resolution from low-resolution videos by convolutional neural networks,

    V . Va ˇsek, V . Franc, and M. Urban, “License plate recognition and super-resolution from low-resolution videos by convolutional neural networks,” in Proc. of British Machine Vision Conference , September

  2. [2]

    Available: ftp://cmp.felk.cvut.cz/pub/cmp/articles/franc/ Vasek-LPR-BMVC2018.pdf

    [Online]. Available: ftp://cmp.felk.cvut.cz/pub/cmp/articles/franc/ Vasek-LPR-BMVC2018.pdf

  3. [3]

    Icdar2015 competition on smartphone document capture and ocr (smartdoc),

    J.-C. Burie, J. Chazalon, M. Coustaty, S. Eskenazi, M. M. Luqman, M. Mehri, N. Nayef, J.-M. Ogier, S. Prum, and M. Rusi ˜nol, “Icdar2015 competition on smartphone document capture and ocr (smartdoc),” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1161–1165

  4. [4]

    Smartdoc-qa: A dataset for quality assessment of smartphone captured document images-single and multiple distortions,

    N. Nayef, M. M. Luqman, S. Prum, S. Eskenazi, J. Chazalon, and J.- M. Ogier, “Smartdoc-qa: A dataset for quality assessment of smartphone captured document images-single and multiple distortions,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1231–1235

  5. [5]

    Smartatid: A mobile captured arabic text images dataset for multi- purpose recognition tasks,

    F. Chabchoub, Y . Kessentini, S. Kanoun, V . Eglin, and F. Lebourgeois, “Smartatid: A mobile captured arabic text images dataset for multi- purpose recognition tasks,” in Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on . IEEE, 2016, pp. 120–125

  6. [6]

    The iam-database: an english sentence database for offline handwriting recognition,

    U.-V . Marti and H. Bunke, “The iam-database: an english sentence database for offline handwriting recognition,” International Journal on Document Analysis and Recognition , vol. 5, no. 1, pp. 39–46, Nov

  7. [7]

    Available: https://doi.org/10.1007/s100320200071

    [Online]. Available: https://doi.org/10.1007/s100320200071

  8. [8]

    RIMES evaluation campaign for handwritten mail processing,

    E. Augustin, J.-m. Brodin, M. Carr, E. Geoffrois, E. Grosicki, and F. Prł- teux, “RIMES evaluation campaign for handwritten mail processing,” in Proc. of the Workshop on Frontiers in Handwriting Recognition , no. 1, 2006

  9. [9]

    Icdar2017 competition on handwritten text recognition on the read dataset,

    J. A. S ´anchez, V . Romero, A. H. Toselli, M. Villegas, and E. Vidal, “Icdar2017 competition on handwritten text recognition on the read dataset,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) , vol. 1. IEEE, 2017, pp. 1383– 1388

  10. [10]

    The impact dataset of historical document images,

    C. Papadopoulos, S. Pletschacher, C. Clausner, and A. Antonacopoulos, “The impact dataset of historical document images,” in Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing. ACM, 2013, pp. 123–130

  11. [11]

    Ic- dar2017 competition on recognition of early indian printed documents- reid2017,

    C. Clausner, A. Antonacopoulos, T. Derrick, and S. Pletschacher, “Ic- dar2017 competition on recognition of early indian printed documents- reid2017,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1411–1416

  12. [12]

    The enp image and ground truth dataset of historical newspapers,

    C. Clausner, C. Papadopoulos, S. Pletschacher, and A. Antonacopoulos, “The enp image and ground truth dataset of historical newspapers,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2015, pp. 931–935

  13. [13]

    Icdar 2015 competition on robust reading,

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on . IEEE, 2015, pp. 1156–1160

  14. [14]

    Icdar2017 robust reading challenge on coco-text,

    R. Gomez, B. Shi, L. Gomez, L. Numann, A. Veit, J. Matas, S. Belongie, and D. Karatzas, “Icdar2017 robust reading challenge on coco-text,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 1435–1443

  15. [15]

    Downtown osaka scene text dataset,

    M. Iwamura, T. Matsuda, N. Morimoto, H. Sato, Y . Ikeda, and K. Kise, “Downtown osaka scene text dataset,” in European Conference on Computer Vision. Springer, 2016, pp. 440–455

  16. [16]

    cBAD: ICDAR2017 competition on baseline detection,

    M. Diem, F. Kleber, S. Fiel, T. Gruning, and B. Gatos, “cBAD: ICDAR2017 competition on baseline detection,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, nov 2017. [Online]. Available: https://doi.org/10. 1109/icdar.2017.222

  17. [17]

    The page (page analysis and ground-truth elements) format framework,

    S. Pletschacher and A. Antonacopoulos, “The page (page analysis and ground-truth elements) format framework,” in 2010 20th International Conference on Pattern Recognition . IEEE, 2010, pp. 257–260

  18. [18]

    Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning . ACM, 2006, pp. 369–376

  19. [19]

    Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition,

    L. Kang, J. I. Toledo, P. Riba, M. Villegas, A. Forn ´es, and M. Rusinol, “Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition,” in German Conference on Pattern Recognition. Springer, 2018, pp. 459–472

  20. [20]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30 , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available: ...

  21. [21]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

  22. [22]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    K. Cho, B. Van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014