pith. sign in

arxiv: 1906.10907 · v1 · pith:PFURVECBnew · submitted 2019-06-26 · 💻 cs.CL

Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction

Pith reviewed 2026-05-25 16:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords OCR post-correctionseq2seq modelsynthetic training datatext repetitionsdenoising autoencoderhistorical newspaperserror distributionFinnish text
0
0 comments X

The pith

Repeating spans in raw OCR corpora yield error distributions that train a seq2seq model outperforming uniform-noise baselines on historical Finnish newspapers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that manually corrected training pairs are not required for OCR post-correction because repeating text spans inside large raw OCR collections can be mined to estimate the actual distribution of recognition errors. Those estimated errors are applied to clean text to produce synthetic training examples, which are then used to train a character-level sequence-to-sequence model that functions as a denoising autoencoder. When this model is tested on a manually corrected corpus of mostly 19th-century Finnish newspapers, it improves both over the original OCR output and over earlier models trained with uniformly generated artificial noise. A sympathetic reader would care because the method removes the main cost barrier to scaling post-correction across very large historical collections.

Core claim

A character-level seq2seq model trained on synthetic data whose error distribution is estimated from repeating text spans outperforms both the baseline OCR and models trained with uniformly generated noise on a manually corrected Finnish newspaper corpus.

What carries the argument

Mining repeating text spans inside a large raw OCR corpus to estimate the real error distribution and then generate aligned synthetic training pairs for a denoising autoencoder.

If this is right

  • The approach eliminates the need to produce manually corrected training data for the post-correction model.
  • The resulting model improves token and character accuracy over the underlying OCR system on the target corpus.
  • Synthetic data built from observed repeats outperforms synthetic data built from uniform random noise.
  • The method is demonstrated on 19th-century Finnish newspaper text but relies only on the existence of repeated spans in the raw corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same repeat-mining step could be applied to other languages or document genres that contain frequent repeated phrases.
  • If repeats are scarce in a given corpus, the error estimates may become unreliable and performance may fall back toward uniform-noise levels.
  • Iterating the process by using the model’s own corrections to find new repeats could further refine the error distribution.
  • The technique might be combined with language-model rescoring to handle cases where the seq2seq model alone leaves residual errors.

Load-bearing premise

The error distribution observed in repeating spans within the large raw OCR corpus is representative of the error distribution present in the target evaluation corpus of 19th-century Finnish newspapers.

What would settle it

If the repeat-derived model shows no accuracy gain over a uniform-noise model when both are evaluated on the manually corrected test set, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.10907 by Aleksi Vesanto, Filip Ginter, Kai Hakala, Niko Miekka, Tapio Salakoski.

Figure 1
Figure 1. Figure 1: A sample from a single cluster. The most common character for each position is chosen as the represen [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

A common approach for improving OCR quality is a post-processing step based on models correcting misdetected characters and tokens. These models are typically trained on aligned pairs of OCR read text and their manually corrected counterparts. In this paper we show that the requirement of manually corrected training data can be alleviated by estimating the OCR errors from repeating text spans found in large OCR read text corpora and generating synthetic training examples following this error distribution. We use the generated data for training a character-level neural seq2seq model and evaluate the performance of the suggested model on a manually corrected corpus of Finnish newspapers mostly from the 19th century. The results show that a clear improvement over the underlying OCR system as well as previously suggested models utilizing uniformly generated noise can be achieved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that OCR error distributions can be estimated from repeating text spans in large raw OCR corpora to generate synthetic training pairs, which are then used to train a character-level seq2seq denoising autoencoder for post-correction. This model is evaluated on a manually corrected corpus of mostly 19th-century Finnish newspapers and reported to outperform both the baseline OCR output and seq2seq models trained with uniformly generated noise, thereby reducing the need for manually aligned training data.

Significance. If the repeat-derived error model generalizes to the target domain, the method offers a practical way to bootstrap post-correction systems for historical texts using only large uncurated OCR collections. The approach is empirical and grounded in observable repetitions rather than fitted parameters or self-referential loops, which is a strength. However, the reported gains rest on an untested assumption about error-distribution match, so the significance is conditional on further validation of that point.

major comments (2)
  1. [Experimental results / evaluation section] The central claim requires that the character-level error statistics (substitutions, insertions, deletions, and specific confusions) observed in repeating spans match those in the single-occurrence text of the evaluation corpus. No direct comparison of these distributions against the gold corrections is reported, leaving the representativeness assumption untested and the source of the observed improvement ambiguous.
  2. [Abstract and results] The abstract and results description provide no quantitative details on the number of repeating spans identified, the alignment procedure used to extract error pairs, the total volume of synthetic data generated, or any statistical significance tests on the reported improvements. These omissions make the scale and reliability of the synthetic training regime difficult to assess.
minor comments (2)
  1. [Method] Clarify in the method section whether the seq2seq model is a standard denoising autoencoder or incorporates additional architectural choices (e.g., attention mechanism details) that could affect reproducibility.
  2. [Abstract] The abstract would benefit from reporting concrete metrics such as character error rate reductions rather than the qualitative statement of 'clear improvement'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment below and will revise the manuscript to improve clarity and strengthen the evidence presented.

read point-by-point responses
  1. Referee: [Experimental results / evaluation section] The central claim requires that the character-level error statistics (substitutions, insertions, deletions, and specific confusions) observed in repeating spans match those in the single-occurrence text of the evaluation corpus. No direct comparison of these distributions against the gold corrections is reported, leaving the representativeness assumption untested and the source of the observed improvement ambiguous.

    Authors: We agree that a direct comparison of error distributions would provide stronger support for the assumption that repeat-derived errors are representative of the target domain. The performance advantage over uniform-noise baselines offers indirect evidence that the estimated distribution is more suitable, but this does not constitute a direct test. In the revised version we will add an explicit comparison of substitution, insertion, deletion, and character-confusion frequencies extracted from the repeating spans against the same statistics computed from the gold corrections in the evaluation corpus. revision: yes

  2. Referee: [Abstract and results] The abstract and results description provide no quantitative details on the number of repeating spans identified, the alignment procedure used to extract error pairs, the total volume of synthetic data generated, or any statistical significance tests on the reported improvements. These omissions make the scale and reliability of the synthetic training regime difficult to assess.

    Authors: The methods section of the manuscript describes the alignment procedure used to extract error pairs from repeats, and the experimental section reports the overall scale of the generated data. However, we acknowledge that additional quantitative detail and formal significance testing would make the experimental regime easier to evaluate. We will expand both the abstract and the results section to report the number of repeating spans identified, the total number of synthetic training pairs produced, and the outcomes of statistical significance tests on the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical estimation from external repetitions

full rationale

The paper estimates OCR error distributions directly from observable differences in repeating spans within large raw corpora, generates synthetic pairs, trains a seq2seq model, and evaluates on an independent manually corrected test set. No equations, parameters, or claims reduce to their own inputs by construction; the representativeness assumption is an empirical hypothesis tested via performance gains over uniform-noise baselines, not a self-definition or self-citation loop. The derivation chain is data-driven and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that repetition-derived errors generalize and on standard neural seq2seq training; no new entities are postulated.

axioms (1)
  • domain assumption Repeating text spans in the raw OCR corpus exhibit the same error distribution as the target evaluation corpus.
    Invoked when generating synthetic training data from repetitions and applying the model to the newspaper test set.

pith-pipeline@v0.9.0 · 5667 in / 1282 out tokens · 33321 ms · 2026-05-25T16:05:15.043173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. Basic local alignment search tool. Journal of molecular biology, 215(3):403--410

  4. [4]

    Wojciech Bieniecki, Szymon Grabowski, and Wojciech Rozenberg. 2007. Image preprocessing for improving OCR accuracy. In Perspective Technologies and Methods in MEMS Design, 2007. MEMSTECH 2007. International Conference on, pages 75--80. IEEE

  5. [5]

    Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724--1734, Doha, Qatar. As...

  6. [6]

    Eva D'hondt, Cyril Grouin, and Brigitte Grau. 2017. Generating a training corpus for OCR post-correction using encoder-decoder model. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1006--1014, Taipei, Taiwan. Asian Federation of Natural Language Processing

  7. [7]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672--2680

  8. [8]

    Sepp Hochreiter and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735--1780

  9. [9]

    Guillaume Klein , Yoon Kim , Yuntian Deng , Jean Senellart , and Alexander M. Rush . OpenNMT : Open-source toolkit for neural machine translation. ArXiv e-prints

  10. [10]

    Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025

  11. [11]

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch . In NIPS-W

  12. [12]

    Martin W. C. Reynaert. 2011. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition (IJDAR), 14(2):173--187

  13. [13]

    Allen Schmaltz, Yoon Kim, Alexander M Rush, and Stuart M Shieber. 2017. Adapting sequence models for sentence correction. arXiv preprint arXiv:1707.09067

  14. [14]

    Sarah Schulz and Jonas Kuhn. 2017. Multi-modular domain-tailored OCR post-correction. In EMNLP

  15. [15]

    David A Smith, Ryan Cordell, and Elizabeth Maddock Dillon. 2013. Infectious texts: Modeling text reuse in nineteenth-century newspapers. In Big Data, 2013 IEEE International Conference on, pages 86--94. IEEE

  16. [16]

    Ray Smith. 2007. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 629--633. IEEE

  17. [17]

    Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, and Filip Ginter. 2017. A system for identifying and exploring text repetition in large historical document corpora. In Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden, 131, pages 330--333. Link \"o ping University Electronic Press

  18. [18]

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML '08, pages 1096--1103, New York, NY, USA. ACM

  19. [19]

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593