arxiv: 2604.16726 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents

Jose M. Saavedra , Crhistopher Stears , Marcelo Pizarro , Crist\'obal Loyola , Luis Aros

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords pattern spottinghistorical documentsself-supervised learningopen-set detectiondocument retrievalnon-maximum suppressionencoderdigital archives

0 comments

The pith

Self-supervised iDoc encoder with open-set detection speeds pattern spotting by 10x and raises small-query precision to 0.612

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces iDocV2 to improve searching for graphical patterns in historical documents, which is becoming essential as digital book collections grow large. It replaces dense search strategies with a self-supervised iDoc encoder and an open-set detector, plus non-maximum suppression to cut false positives. The model matches existing accuracy levels for pattern spotting and retrieval but processes queries ten times faster than before. For the difficult case of small non-square query patterns, it sets a new record precision of 0.612. This matters because prior methods took up to seven seconds per search, making large-scale use impractical.

Core claim

The authors establish that their iDocV2 model, based on self-supervision for the iDoc encoder and an open-set detector with non-maximum suppression, delivers competitive performance in pattern spotting and document retrieval while achieving a 10x speed improvement and a new state-of-the-art precision of 0.612 on small non-square queries.

What carries the argument

The iDoc encoder trained with self-supervision, integrated with open-set detection and non-maximum suppression to enable efficient matching of patterns in document images without exhaustive dense searches.

If this is right

Searching in historical document datasets becomes ten times faster, reducing time from seconds to sub-second levels.
Precision on small non-square queries improves to 0.612, exceeding the previous best of 0.427.
Overall results stay competitive with state-of-the-art methods in both spotting and retrieval tasks.
Non-maximum suppression lowers the rate of false positive detections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-supervised training strategy could extend to other visual search problems in document analysis where labeled data is scarce.
Real-time pattern spotting might enable new tools for historians to explore digitized collections interactively.
Further validation on varied document types would test if the speed and accuracy benefits persist beyond the evaluated dataset.

Load-bearing premise

The combination of the self-supervised iDoc encoder, open-set detection, and non-maximum suppression will produce similar speed and accuracy improvements on other collections of historical documents.

What would settle it

Evaluating the model on an independent historical document dataset and observing either a speed improvement of less than 5x or a precision for small non-square queries below 0.5 would disprove the reported advances.

read the original abstract

Considering the imminent massification of digital books, it has become critical to facilitate searching collections through graphical patterns. Current strategies for document retrieval and pattern spotting in historical documents still need to be improved. State-of-the-art strategies achieve an overall precision of $0.494$ for pattern spotting, where the precision for small non-square queries reaches 0.427. In addition, the processing time is excessive, requiring up to 7 seconds for searching in the DocExplore dataset due to a dense-based strategy used by SOTA models. Therefore, we propose a new model based on a better encoder (iDoc), trained under a self-supervised strategy, and an open-set detector to accelerate searching. Our model achieves competitive results with state-of-the-art pattern spotting and document retrieval, improving speed by 10x. Furthermore, our model reaches a new SOTA performance on the small non-square queries, achieving a new precision of 0.612.Different from the previous version, this leverages non-maximum suppression to reduce false positives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iDocV2 delivers a practical incremental gain on historical document pattern spotting by adding self-supervised encoding, open-set detection, and NMS, with a claimed 10x speed-up and new 0.612 precision on small non-square queries.

read the letter

This paper extends the earlier iDoc model with self-supervised training on the encoder, an open-set detector, and non-maximum suppression to cut false positives. On the DocExplore dataset it reports competitive overall results, a new high of 0.612 precision for small non-square queries (up from the prior 0.427), and a 10x drop in search time by moving away from dense strategies. The changes directly target the speed and accuracy limits mentioned in the abstract. The practical focus on large digitized collections is the clearest strength; the speed claim addresses a genuine bottleneck for domain users, and the precision lift on the hardest query type is specific and measurable. The method stays grounded in prior work without overclaiming a new framework. The main soft spot is the narrow scope: results are tied to one benchmark and query style, with no reported tests on other collections or query distributions. Training details, run-to-run variance, and baseline re-implementation steps are not foregrounded in the summary, though the direct numeric comparisons hold up without internal contradictions or circular definitions. This is aimed at researchers in document image analysis and digital humanities tools. Someone building visual search systems for archives would find the speed and precision numbers usable. It deserves peer review because the empirical deltas are concrete, the design choices are transparent extensions of existing ideas, and the claims can be checked against the cited dataset.

Referee Report

0 major / 2 minor

Summary. The manuscript presents iDocV2, a model for pattern spotting and document retrieval in historical documents. It employs a self-supervised iDoc encoder together with an open-set detector and non-maximum suppression. On the DocExplore dataset the approach is reported to deliver competitive overall performance with prior SOTA methods, a 10x reduction in processing time relative to dense baselines, and a new SOTA precision of 0.612 on small non-square queries (improving on the prior 0.427).

Significance. If the empirical gains hold under scrutiny, the work offers a practical advance for scalable analysis of large digitized historical collections. The combination of self-supervision and open-set detection yields both accuracy gains on difficult query types and a substantial speed-up, addressing two central limitations of existing dense retrieval pipelines. The emphasis on efficiency without sacrificing precision on small non-square patterns is a useful contribution to document image analysis.

minor comments (2)

The abstract states that the model achieves 'competitive results' with SOTA but reports only the baseline overall precision (0.494); the exact overall precision attained by iDocV2 should be stated explicitly for direct comparison.
Section describing the experimental protocol should include a concise summary of training hyperparameters, number of runs, and any statistical testing performed, even if full details appear in supplementary material.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The summary correctly captures the core contributions of iDocV2: the self-supervised encoder, open-set detection, and non-maximum suppression that together deliver competitive overall precision, a 10x speed-up, and a new state-of-the-art result of 0.612 on small non-square queries.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical ML pipeline: a self-supervised iDoc encoder plus open-set detection and NMS for pattern spotting. All reported numbers (0.612 precision on small non-square queries, 10x speedup on DocExplore) are direct experimental outcomes of training and inference on held-out data, not quantities defined in terms of fitted parameters from the same data or reduced by any equation. No derivations, uniqueness theorems, or ansatzes appear; the central claims rest on standard train/eval comparisons rather than self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities. The approach relies on established techniques (self-supervised learning and open-set detection) whose internal assumptions are not detailed here.

pith-pipeline@v0.9.0 · 5502 in / 1230 out tokens · 34102 ms · 2026-05-10T08:29:31.338010+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

In: 9th Interna- tional Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

OpenReview.net Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th Interna- tional Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

2021
[2]

In: 12th International Confer- ence on Document Analysis and Recognition, ICDAR 2013, Washington, DC, USA, August 25-28, 2013

OpenReview.net Dovgalecs V, Burnett A, Tranouez P, et al (2013) Spot it! finding words and patterns in histor- ical documents. In: 12th International Confer- ence on Document Analysis and Recognition, ICDAR 2013, Washington, DC, USA, August 25-28, 2013. IEEE Computer Society, pp 1039– 1043 10 El-Hajj H, Valleriani M (2023) Prompt me a dataset: An investig...

work page arXiv 2013
[3]

iBOT: Image BERT Pre-Training with Online Tokenizer

OpenReview.net Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv:230402643 Lin T, Doll´ ar P, Girshick RB, et al (2017) Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, pp 936–944 Liu S, Zeng Z, Ren T...

work page internal anchor Pith review arXiv 2023