Recognition: unknown
iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents
Pith reviewed 2026-05-10 08:29 UTC · model grok-4.3
The pith
Self-supervised iDoc encoder with open-set detection speeds pattern spotting by 10x and raises small-query precision to 0.612
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that their iDocV2 model, based on self-supervision for the iDoc encoder and an open-set detector with non-maximum suppression, delivers competitive performance in pattern spotting and document retrieval while achieving a 10x speed improvement and a new state-of-the-art precision of 0.612 on small non-square queries.
What carries the argument
The iDoc encoder trained with self-supervision, integrated with open-set detection and non-maximum suppression to enable efficient matching of patterns in document images without exhaustive dense searches.
If this is right
- Searching in historical document datasets becomes ten times faster, reducing time from seconds to sub-second levels.
- Precision on small non-square queries improves to 0.612, exceeding the previous best of 0.427.
- Overall results stay competitive with state-of-the-art methods in both spotting and retrieval tasks.
- Non-maximum suppression lowers the rate of false positive detections.
Where Pith is reading between the lines
- The self-supervised training strategy could extend to other visual search problems in document analysis where labeled data is scarce.
- Real-time pattern spotting might enable new tools for historians to explore digitized collections interactively.
- Further validation on varied document types would test if the speed and accuracy benefits persist beyond the evaluated dataset.
Load-bearing premise
The combination of the self-supervised iDoc encoder, open-set detection, and non-maximum suppression will produce similar speed and accuracy improvements on other collections of historical documents.
What would settle it
Evaluating the model on an independent historical document dataset and observing either a speed improvement of less than 5x or a precision for small non-square queries below 0.5 would disprove the reported advances.
read the original abstract
Considering the imminent massification of digital books, it has become critical to facilitate searching collections through graphical patterns. Current strategies for document retrieval and pattern spotting in historical documents still need to be improved. State-of-the-art strategies achieve an overall precision of $0.494$ for pattern spotting, where the precision for small non-square queries reaches 0.427. In addition, the processing time is excessive, requiring up to 7 seconds for searching in the DocExplore dataset due to a dense-based strategy used by SOTA models. Therefore, we propose a new model based on a better encoder (iDoc), trained under a self-supervised strategy, and an open-set detector to accelerate searching. Our model achieves competitive results with state-of-the-art pattern spotting and document retrieval, improving speed by 10x. Furthermore, our model reaches a new SOTA performance on the small non-square queries, achieving a new precision of 0.612.Different from the previous version, this leverages non-maximum suppression to reduce false positives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents iDocV2, a model for pattern spotting and document retrieval in historical documents. It employs a self-supervised iDoc encoder together with an open-set detector and non-maximum suppression. On the DocExplore dataset the approach is reported to deliver competitive overall performance with prior SOTA methods, a 10x reduction in processing time relative to dense baselines, and a new SOTA precision of 0.612 on small non-square queries (improving on the prior 0.427).
Significance. If the empirical gains hold under scrutiny, the work offers a practical advance for scalable analysis of large digitized historical collections. The combination of self-supervision and open-set detection yields both accuracy gains on difficult query types and a substantial speed-up, addressing two central limitations of existing dense retrieval pipelines. The emphasis on efficiency without sacrificing precision on small non-square patterns is a useful contribution to document image analysis.
minor comments (2)
- The abstract states that the model achieves 'competitive results' with SOTA but reports only the baseline overall precision (0.494); the exact overall precision attained by iDocV2 should be stated explicitly for direct comparison.
- Section describing the experimental protocol should include a concise summary of training hyperparameters, number of runs, and any statistical testing performed, even if full details appear in supplementary material.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the recommendation for minor revision. The summary correctly captures the core contributions of iDocV2: the self-supervised encoder, open-set detection, and non-maximum suppression that together deliver competitive overall precision, a 10x speed-up, and a new state-of-the-art result of 0.612 on small non-square queries.
Circularity Check
No significant circularity
full rationale
The paper describes an empirical ML pipeline: a self-supervised iDoc encoder plus open-set detection and NMS for pattern spotting. All reported numbers (0.612 precision on small non-square queries, 10x speedup on DocExplore) are direct experimental outcomes of training and inference on held-out data, not quantities defined in terms of fitted parameters from the same data or reduced by any equation. No derivations, uniqueness theorems, or ansatzes appear; the central claims rest on standard train/eval comparisons rather than self-referential definitions or load-bearing self-citations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: 9th Interna- tional Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
OpenReview.net Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th Interna- tional Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
2021
-
[2]
OpenReview.net Dovgalecs V, Burnett A, Tranouez P, et al (2013) Spot it! finding words and patterns in histor- ical documents. In: 12th International Confer- ence on Document Analysis and Recognition, ICDAR 2013, Washington, DC, USA, August 25-28, 2013. IEEE Computer Society, pp 1039– 1043 10 El-Hajj H, Valleriani M (2023) Prompt me a dataset: An investig...
-
[3]
iBOT: Image BERT Pre-Training with Online Tokenizer
OpenReview.net Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv:230402643 Lin T, Doll´ ar P, Girshick RB, et al (2017) Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, pp 936–944 Liu S, Zeng Z, Ren T...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.