Pattern Spotting in Historical Documents Using Convolutional Models

Caroline Petitjean; Ignacio \'Ubeda; Jose M. Saavedra; Laurent Heutte; St\'ephane Nicolas

arxiv: 1906.08580 · v1 · pith:Q2SYSFDFnew · submitted 2019-06-20 · 💻 cs.CV

Pattern Spotting in Historical Documents Using Convolutional Models

Ignacio \'Ubeda , Jose M. Saavedra , St\'ephane Nicolas , Caroline Petitjean , Laurent Heutte This is my paper

Pith reviewed 2026-05-25 19:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords pattern spottinghistorical documentsRetinaNetmultiscale embeddingsconvolutional feature extractionimage retrievalDocExplore datasetgraphical pattern search

0 comments

The pith

RetinaNet extracts multiscale embeddings that locate graphical patterns in historical documents more accurately and with less storage than prior systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using RetinaNet as a feature extractor to generate multiscale embeddings from regions in historical document images and from image queries, enabling pattern spotting without any predefined class or task-specific training. Experiments on the DocExplore dataset show this approach locates patterns better than the state-of-the-art system and indexes the collection with lower storage cost. The method still fails to retrieve some pages that contain multiple instances of the query pattern. A reader would care because pattern spotting lets scholars search large digitized archives for repeating graphical elements without building a separate model for each one.

Core claim

The central claim is that multiscale embeddings extracted by a RetinaNet model pre-trained on natural images remain sufficiently discriminative to locate occurrences of an arbitrary graphical query object inside historical document images, delivering higher location accuracy and lower indexing storage than the previous best system on the DocExplore dataset.

What carries the argument

Multiscale embeddings produced by RetinaNet acting as a feature extractor for both document regions and queries, followed by similarity search.

If this is right

Higher accuracy at locating single-instance patterns than the prior system on DocExplore.
Lower storage cost for indexing the full document collection.
Failure to retrieve some pages containing multiple instances of the same query.
No need for class labels or per-pattern training to perform the search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The transfer from natural-image pre-training suggests similar embeddings could be tried on other non-photographic domains such as diagrams or maps.
Storage savings could allow indexing of much larger archives than current methods permit.
The multiple-instance failure case points to a possible next step of combining the embeddings with a lightweight detection head.
The approach could be tested on other historical document collections to check whether the accuracy gain holds beyond DocExplore.

Load-bearing premise

Embeddings from a model trained on natural images stay discriminative for arbitrary graphical patterns in historical documents without any task-specific adaptation or class labels.

What would settle it

A direct comparison on the DocExplore dataset in which the RetinaNet embedding method fails to exceed the state-of-the-art system in pattern location accuracy or exceeds it in required storage.

Figures

Figures reproduced from arXiv: 1906.08580 by Caroline Petitjean, Ignacio \'Ubeda, Jose M. Saavedra, Laurent Heutte, St\'ephane Nicolas.

**Figure 2.** Figure 2: Sub-pages example. All sub-pages have the same shape (103 × 103 ); overlapping may exist between them [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Neuron RF across pyramid levels example. The left corresponds to P3 while the right to P4. Note that deeper neurons have bigger RF than shallow neurons. Contrary to the pages where we extract a pyramid of feature maps, for the queries we only extract one embedding per each query. As these are centered in the input, we keep the center neuron at level Pk as the embedding for each one. We assign a query of wi… view at source ↗

**Figure 4.** Figure 4: RF centers (red points) for all neurons at P3 level [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: Bounding box localization for an instance of category “D” (letter “D” as a dropped initial). The red point in the left image corresponds to the RF center of the closest embedding while the right image shows the bounding box localization with the center already translated. 4 Experiments and Evaluation 4.1 Experimental Protocol We make use of the DocExplore dataset [1, 8] to compare our proposal to the stat… view at source ↗

**Figure 5.** Figure 5: Example of a label page at level P3 for the NonText classifier. Each point is a RF center of a neuron. Green is for black, red for text and blue for non-text class. 3.5 Query Retrieval and Query Localization For retrieval, queries are searched only at the same level they were assigned to (e.g. if a query was assigned at P4 level, then it is looked for only at the P4 level of the pages). We use dot distance… view at source ↗

**Figure 7.** Figure 7: we show a page example of the regions predicted by the classifier [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 9.** Figure 9: Localization example for top 50 retrievals for category “Brace Ornament” (left) and “Corner Diamond” (right). The ground truth is drawn in red, the bounding boxes returned by our system in yellow and by the state-of-the-art system in green. Ranking position of the bounding box retrieval is marked in top-left for our system and top-right for state-of-the-art system. 5 Conclusions In this work, a deep lea… view at source ↗

read the original abstract

Pattern spotting consists of searching in a collection of historical document images for occurrences of a graphical object using an image query. Contrary to object detection, no prior information nor predefined class is given about the query so training a model of the object is not feasible. In this paper, a convolutional neural network approach is proposed to tackle this problem. We use RetinaNet as a feature extractor to obtain multiscale embeddings of the regions of the documents and also for the queries. Experiments conducted on the DocExplore dataset show that our proposal is better at locating patterns and requires less storage for indexing images than the state-of-the-art system, but fails at retrieving some pages containing multiple instances of the query.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RetinaNet multiscale embeddings give a modest reported lift on DocExplore pattern spotting with lower storage, but the evaluation is thin and the multiple-instance failure plus domain-shift risk make the gains look provisional.

read the letter

The one thing to know is that this paper takes a RetinaNet pretrained on COCO, freezes it, and uses its multiscale feature maps to embed both document regions and image queries for pattern spotting without any class labels or task training. On the DocExplore dataset the approach is said to locate patterns better than the prior system while needing less index storage, though it fails to retrieve pages that contain several copies of the query.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes using a RetinaNet model pre-trained on COCO as a frozen feature extractor to generate multiscale embeddings for both document regions and queries in order to perform pattern spotting in historical documents without task-specific training or class information. Experiments on the DocExplore dataset are said to demonstrate better pattern location and lower storage requirements than the state-of-the-art, although the method fails to retrieve some pages containing multiple instances of the query.

Significance. If the empirical claims are substantiated with detailed metrics and ablations, this work could demonstrate the viability of direct transfer of object detection backbones for unsupervised pattern retrieval in degraded historical documents, offering efficiency gains in storage and computation for large collections. The approach avoids the need for labeled data specific to the patterns, which is a practical advantage in digital humanities.

major comments (3)

[Abstract] Abstract: The claim that the proposal 'is better at locating patterns' supplies no quantitative metrics, error bars, baseline details, or analysis of the noted failure mode with multiple instances, which is load-bearing for the central empirical claim.
[Experiments] Experiments section: No ablation on pretraining source, fine-tuning, or alternative backbones is described to isolate whether gains derive from the transfer assumption or other design choices, leaving the domain-shift concern unaddressed despite the reported failure cases.
[Method] Method section: The multiscale embeddings are extracted from a frozen RetinaNet without task-specific adaptation, yet no analysis or quantitative breakdown is given for why this remains discriminative for arbitrary historical patterns exhibiting ink bleed, parchment texture, and degradation.

minor comments (2)

The abstract and main text should include at least one table or figure with concrete performance numbers (e.g., precision@K or storage in MB) to support the comparative claims.
Clarify the exact indexing and matching procedure for the embeddings to make the storage-reduction claim reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be incorporated to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the proposal 'is better at locating patterns' supplies no quantitative metrics, error bars, baseline details, or analysis of the noted failure mode with multiple instances, which is load-bearing for the central empirical claim.

Authors: We agree that the abstract would be strengthened by including specific quantitative details. In the revised manuscript, we will update the abstract to report key metrics such as the improvement in mean average precision over the state-of-the-art baseline on DocExplore, the storage reduction factor, and a note on the multiple-instance failure cases observed in the experiments. revision: yes
Referee: [Experiments] Experiments section: No ablation on pretraining source, fine-tuning, or alternative backbones is described to isolate whether gains derive from the transfer assumption or other design choices, leaving the domain-shift concern unaddressed despite the reported failure cases.

Authors: This is a fair criticism of the current experiments section. Our design intentionally avoids fine-tuning to demonstrate direct transfer from COCO pre-training. We will add a discussion paragraph addressing the domain-shift issue and the rationale for not performing full ablations (computational focus on the transfer setting), while acknowledging this as a limitation. A partial revision is planned. revision: partial
Referee: [Method] Method section: The multiscale embeddings are extracted from a frozen RetinaNet without task-specific adaptation, yet no analysis or quantitative breakdown is given for why this remains discriminative for arbitrary historical patterns exhibiting ink bleed, parchment texture, and degradation.

Authors: We will revise the method section to provide the requested analysis. We will explain that the feature pyramid in RetinaNet yields multiscale convolutional features pre-trained on diverse natural images, which capture edge and texture invariants robust to local degradations such as ink bleed and parchment variations; this will be tied to the empirical results on DocExplore without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical application of public pre-trained model on external dataset

full rationale

The paper applies RetinaNet (publicly available, pre-trained on COCO) as a frozen multiscale feature extractor for pattern spotting on the external DocExplore dataset, with direct empirical comparison to SOTA on retrieval metrics and storage. No equations, derivations, or parameter-fitting steps are described that reduce to self-defined quantities or self-citations. The central claim is performance improvement via standard transfer learning, which is externally falsifiable on the held-out dataset and does not rely on any load-bearing self-referential definitions or uniqueness theorems from the authors. This is a standard empirical CV application with no mathematical chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on transfer-learning assumptions standard in computer vision and on the representativeness of the DocExplore dataset; no new entities or fitted constants are introduced.

axioms (1)

domain assumption RetinaNet embeddings trained on natural images transfer to historical document patterns without fine-tuning
Invoked by the choice to use RetinaNet directly as feature extractor for arbitrary queries

pith-pipeline@v0.9.0 · 5649 in / 1045 out tokens · 33448 ms · 2026-05-25T19:46:51.694381+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

S. En, S. Nicolas, C. Petitjean, F. Jurie, and L. Heutte. 2016. New public dataset for spotting patterns in medieval document images. Journal of Electronic Imaging 26, 1 (2016), 011010

work page 2016
[2]

S. En, C. Petitjean, S. Nicolas, and L. Heutte. 2016. A scalable pattern spotting system for historical documents.Pattern Recognition 54 (2016), 149–161

work page 2016
[3]

Giotis, G

A. Giotis, G. Sfikas, B. Gatos, and C. Nikou. 2017. A Survey of Document Image Word Spotting Techniques. Pattern Recogn 68 (2017), 310–332

work page 2017
[4]

Lecoutre, B

A. Lecoutre, B. Negrevergne, and F. Yger. 2017. Recognizing Art Style Automatically in painting with deep learning. In ACML. 327–342

work page 2017
[5]

T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. 2017. Feature Pyramid Networks for Object Detection.. In CVPR, Vol. 1. 3

work page 2017
[6]

T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. 2018. Focal loss for dense object detection. IEEE Trans. on PAMI (2018)

work page 2018
[7]

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740–755

work page 2014
[8]

LITIS. 2016. Pattern Spotting in Medieval Document Images. Retrieved June 7, 2019 from http://spotting.univ-rouen.fr/

work page 2016
[9]

W. Luo, Y. Li, R. Urtasun, and R. Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In NIPS. 4898–4906

work page 2016
[10]

Rakthanmanon, Q

T. Rakthanmanon, Q. Zhu, and E. Keogh. 2011. Searching Historical Manuscripts for Near-duplicate Figures. In HIP’11, ACM. 14–21

work page 2011
[11]

X. Shen, A. Efros, and M. Aubry. 2019. Discovering Visual Patterns in Art Collections with Spatially-consistent Feature Learning. In CVPR

work page 2019
[12]

PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents

S. Sudholt and G. Fink. 2016. PHOCNet: A Deep Convolutional Neu- ral Network for Word Spotting in Handwritten Documents. CoRR abs/1604.00187 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

J. Wan, D. Wang, S. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. 2014. Deep learning for content-based image retrieval: A comprehensive study. In 22nd ACM International Conference on Multimedia . ACM, 157–166

work page 2014
[14]

Yarlagadda, A

P. Yarlagadda, A. Monroy, B. Carque, and B. Ommer. 2010. Recognition and Analysis of Objects in Medieval Images. InProc. of ACCV, Workshop on e-Heritage. Springer, 296–305

work page 2010

[1] [1]

S. En, S. Nicolas, C. Petitjean, F. Jurie, and L. Heutte. 2016. New public dataset for spotting patterns in medieval document images. Journal of Electronic Imaging 26, 1 (2016), 011010

work page 2016

[2] [2]

S. En, C. Petitjean, S. Nicolas, and L. Heutte. 2016. A scalable pattern spotting system for historical documents.Pattern Recognition 54 (2016), 149–161

work page 2016

[3] [3]

Giotis, G

A. Giotis, G. Sfikas, B. Gatos, and C. Nikou. 2017. A Survey of Document Image Word Spotting Techniques. Pattern Recogn 68 (2017), 310–332

work page 2017

[4] [4]

Lecoutre, B

A. Lecoutre, B. Negrevergne, and F. Yger. 2017. Recognizing Art Style Automatically in painting with deep learning. In ACML. 327–342

work page 2017

[5] [5]

T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. 2017. Feature Pyramid Networks for Object Detection.. In CVPR, Vol. 1. 3

work page 2017

[6] [6]

T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. 2018. Focal loss for dense object detection. IEEE Trans. on PAMI (2018)

work page 2018

[7] [7]

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740–755

work page 2014

[8] [8]

LITIS. 2016. Pattern Spotting in Medieval Document Images. Retrieved June 7, 2019 from http://spotting.univ-rouen.fr/

work page 2016

[9] [9]

W. Luo, Y. Li, R. Urtasun, and R. Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In NIPS. 4898–4906

work page 2016

[10] [10]

Rakthanmanon, Q

T. Rakthanmanon, Q. Zhu, and E. Keogh. 2011. Searching Historical Manuscripts for Near-duplicate Figures. In HIP’11, ACM. 14–21

work page 2011

[11] [11]

X. Shen, A. Efros, and M. Aubry. 2019. Discovering Visual Patterns in Art Collections with Spatially-consistent Feature Learning. In CVPR

work page 2019

[12] [12]

PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents

S. Sudholt and G. Fink. 2016. PHOCNet: A Deep Convolutional Neu- ral Network for Word Spotting in Handwritten Documents. CoRR abs/1604.00187 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

J. Wan, D. Wang, S. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. 2014. Deep learning for content-based image retrieval: A comprehensive study. In 22nd ACM International Conference on Multimedia . ACM, 157–166

work page 2014

[14] [14]

Yarlagadda, A

P. Yarlagadda, A. Monroy, B. Carque, and B. Ommer. 2010. Recognition and Analysis of Objects in Medieval Images. InProc. of ACCV, Workshop on e-Heritage. Springer, 296–305

work page 2010