Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction
Pith reviewed 2026-05-25 16:05 UTC · model grok-4.3
The pith
Repeating spans in raw OCR corpora yield error distributions that train a seq2seq model outperforming uniform-noise baselines on historical Finnish newspapers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A character-level seq2seq model trained on synthetic data whose error distribution is estimated from repeating text spans outperforms both the baseline OCR and models trained with uniformly generated noise on a manually corrected Finnish newspaper corpus.
What carries the argument
Mining repeating text spans inside a large raw OCR corpus to estimate the real error distribution and then generate aligned synthetic training pairs for a denoising autoencoder.
If this is right
- The approach eliminates the need to produce manually corrected training data for the post-correction model.
- The resulting model improves token and character accuracy over the underlying OCR system on the target corpus.
- Synthetic data built from observed repeats outperforms synthetic data built from uniform random noise.
- The method is demonstrated on 19th-century Finnish newspaper text but relies only on the existence of repeated spans in the raw corpus.
Where Pith is reading between the lines
- The same repeat-mining step could be applied to other languages or document genres that contain frequent repeated phrases.
- If repeats are scarce in a given corpus, the error estimates may become unreliable and performance may fall back toward uniform-noise levels.
- Iterating the process by using the model’s own corrections to find new repeats could further refine the error distribution.
- The technique might be combined with language-model rescoring to handle cases where the seq2seq model alone leaves residual errors.
Load-bearing premise
The error distribution observed in repeating spans within the large raw OCR corpus is representative of the error distribution present in the target evaluation corpus of 19th-century Finnish newspapers.
What would settle it
If the repeat-derived model shows no accuracy gain over a uniform-noise model when both are evaluated on the manually corrected test set, the central claim would be falsified.
Figures
read the original abstract
A common approach for improving OCR quality is a post-processing step based on models correcting misdetected characters and tokens. These models are typically trained on aligned pairs of OCR read text and their manually corrected counterparts. In this paper we show that the requirement of manually corrected training data can be alleviated by estimating the OCR errors from repeating text spans found in large OCR read text corpora and generating synthetic training examples following this error distribution. We use the generated data for training a character-level neural seq2seq model and evaluate the performance of the suggested model on a manually corrected corpus of Finnish newspapers mostly from the 19th century. The results show that a clear improvement over the underlying OCR system as well as previously suggested models utilizing uniformly generated noise can be achieved.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that OCR error distributions can be estimated from repeating text spans in large raw OCR corpora to generate synthetic training pairs, which are then used to train a character-level seq2seq denoising autoencoder for post-correction. This model is evaluated on a manually corrected corpus of mostly 19th-century Finnish newspapers and reported to outperform both the baseline OCR output and seq2seq models trained with uniformly generated noise, thereby reducing the need for manually aligned training data.
Significance. If the repeat-derived error model generalizes to the target domain, the method offers a practical way to bootstrap post-correction systems for historical texts using only large uncurated OCR collections. The approach is empirical and grounded in observable repetitions rather than fitted parameters or self-referential loops, which is a strength. However, the reported gains rest on an untested assumption about error-distribution match, so the significance is conditional on further validation of that point.
major comments (2)
- [Experimental results / evaluation section] The central claim requires that the character-level error statistics (substitutions, insertions, deletions, and specific confusions) observed in repeating spans match those in the single-occurrence text of the evaluation corpus. No direct comparison of these distributions against the gold corrections is reported, leaving the representativeness assumption untested and the source of the observed improvement ambiguous.
- [Abstract and results] The abstract and results description provide no quantitative details on the number of repeating spans identified, the alignment procedure used to extract error pairs, the total volume of synthetic data generated, or any statistical significance tests on the reported improvements. These omissions make the scale and reliability of the synthetic training regime difficult to assess.
minor comments (2)
- [Method] Clarify in the method section whether the seq2seq model is a standard denoising autoencoder or incorporates additional architectural choices (e.g., attention mechanism details) that could affect reproducibility.
- [Abstract] The abstract would benefit from reporting concrete metrics such as character error rate reductions rather than the qualitative statement of 'clear improvement'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major comment below and will revise the manuscript to improve clarity and strengthen the evidence presented.
read point-by-point responses
-
Referee: [Experimental results / evaluation section] The central claim requires that the character-level error statistics (substitutions, insertions, deletions, and specific confusions) observed in repeating spans match those in the single-occurrence text of the evaluation corpus. No direct comparison of these distributions against the gold corrections is reported, leaving the representativeness assumption untested and the source of the observed improvement ambiguous.
Authors: We agree that a direct comparison of error distributions would provide stronger support for the assumption that repeat-derived errors are representative of the target domain. The performance advantage over uniform-noise baselines offers indirect evidence that the estimated distribution is more suitable, but this does not constitute a direct test. In the revised version we will add an explicit comparison of substitution, insertion, deletion, and character-confusion frequencies extracted from the repeating spans against the same statistics computed from the gold corrections in the evaluation corpus. revision: yes
-
Referee: [Abstract and results] The abstract and results description provide no quantitative details on the number of repeating spans identified, the alignment procedure used to extract error pairs, the total volume of synthetic data generated, or any statistical significance tests on the reported improvements. These omissions make the scale and reliability of the synthetic training regime difficult to assess.
Authors: The methods section of the manuscript describes the alignment procedure used to extract error pairs from repeats, and the experimental section reports the overall scale of the generated data. However, we acknowledge that additional quantitative detail and formal significance testing would make the experimental regime easier to evaluate. We will expand both the abstract and the results section to report the number of repeating spans identified, the total number of synthetic training pairs produced, and the outcomes of statistical significance tests on the observed improvements. revision: yes
Circularity Check
No circularity: empirical estimation from external repetitions
full rationale
The paper estimates OCR error distributions directly from observable differences in repeating spans within large raw corpora, generates synthetic pairs, trains a seq2seq model, and evaluates on an independent manually corrected test set. No equations, parameters, or claims reduce to their own inputs by construction; the representativeness assumption is an empirical hypothesis tested via performance gains over uniform-noise baselines, not a self-definition or self-citation loop. The derivation chain is data-driven and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Repeating text spans in the raw OCR corpus exhibit the same error distribution as the target evaluation corpus.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. Basic local alignment search tool. Journal of molecular biology, 215(3):403--410
work page 1990
-
[4]
Wojciech Bieniecki, Szymon Grabowski, and Wojciech Rozenberg. 2007. Image preprocessing for improving OCR accuracy. In Perspective Technologies and Methods in MEMS Design, 2007. MEMSTECH 2007. International Conference on, pages 75--80. IEEE
work page 2007
-
[5]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724--1734, Doha, Qatar. As...
work page 2014
-
[6]
Eva D'hondt, Cyril Grouin, and Brigitte Grau. 2017. Generating a training corpus for OCR post-correction using encoder-decoder model. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1006--1014, Taipei, Taiwan. Asian Federation of Natural Language Processing
work page 2017
-
[7]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672--2680
work page 2014
-
[8]
Sepp Hochreiter and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735--1780
work page 1997
-
[9]
Guillaume Klein , Yoon Kim , Yuntian Deng , Jean Senellart , and Alexander M. Rush . OpenNMT : Open-source toolkit for neural machine translation. ArXiv e-prints
-
[10]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch . In NIPS-W
work page 2017
-
[12]
Martin W. C. Reynaert. 2011. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition (IJDAR), 14(2):173--187
work page 2011
-
[13]
Allen Schmaltz, Yoon Kim, Alexander M Rush, and Stuart M Shieber. 2017. Adapting sequence models for sentence correction. arXiv preprint arXiv:1707.09067
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Sarah Schulz and Jonas Kuhn. 2017. Multi-modular domain-tailored OCR post-correction. In EMNLP
work page 2017
-
[15]
David A Smith, Ryan Cordell, and Elizabeth Maddock Dillon. 2013. Infectious texts: Modeling text reuse in nineteenth-century newspapers. In Big Data, 2013 IEEE International Conference on, pages 86--94. IEEE
work page 2013
-
[16]
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 629--633. IEEE
work page 2007
-
[17]
Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, and Filip Ginter. 2017. A system for identifying and exploring text repetition in large historical document corpora. In Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden, 131, pages 330--333. Link \"o ping University Electronic Press
work page 2017
-
[18]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML '08, pages 1096--1103, New York, NY, USA. ACM
work page 2008
- [19]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.