Image Retrieval and Pattern Spotting using Siamese Neural Network
Pith reviewed 2026-05-25 17:49 UTC · model grok-4.3
The pith
A Siamese neural network trained only on natural image pairs can retrieve and spot patterns in document images with high accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Siamese Neural Network trained on a subset of image pairs from the ImageNet dataset learns a similarity-based representation. This representation provides feature maps that find relevant document image candidates given a query, leading to 0.94 mAP for retrieval and 0.83 mAP for pattern spotting at IoU=0.7 on the Tobacco800 dataset, outperforming state-of-the-art document image retrieval methods.
What carries the argument
Siamese Neural Network trained on image pairs to produce similarity-based feature maps for matching.
If this is right
- The learned features support both whole-image retrieval and localized pattern spotting.
- Performance holds with varying feature map sizes, trading some accuracy for reduced computation.
- Manual feature engineering can be replaced by this learned similarity approach in document collections.
- Results suggest the method applies to public document image datasets without additional adaptation.
Where Pith is reading between the lines
- Similar transfer might work for other specialized image domains like medical scans or historical archives.
- The same model could be tested on retrieval tasks outside documents to check cross-domain generality.
- It raises the question of whether document-specific training data is needed at all for similarity-based matching.
Load-bearing premise
Similarity features learned from ImageNet natural-image pairs transfer directly to document images without further domain adaptation or document-specific training data.
What would settle it
Substantially lower mAP scores when the same network is tested on Tobacco800 after training on document image pairs instead would challenge the direct transfer claim.
read the original abstract
This paper presents a novel approach for image retrieval and pattern spotting in document image collections. The manual feature engineering is avoided by learning a similarity-based representation using a Siamese Neural Network trained on a previously prepared subset of image pairs from the ImageNet dataset. The learned representation is used to provide the similarity-based feature maps used to find relevant image candidates in the data collection given an image query. A robust experimental protocol based on the public Tobacco800 document image collection shows that the proposed method compares favorably against state-of-the-art document image retrieval methods, reaching 0.94 and 0.83 of mean average precision (mAP) for retrieval and pattern spotting (IoU=0.7), respectively. Besides, we have evaluated the proposed method considering feature maps of different sizes, showing the impact of reducing the number of features in the retrieval performance and time-consuming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a Siamese Neural Network trained solely on pairs from the ImageNet dataset can learn transferable similarity features for image retrieval and pattern spotting on document images. Using the public Tobacco800 collection, it reports mean average precision of 0.94 for retrieval and 0.83 for pattern spotting (at IoU=0.7), states that these results compare favorably to prior document-specific methods, and examines the effect of reducing feature-map dimensionality on accuracy and runtime.
Significance. If the reported mAP numbers are reproducible and the domain transfer holds, the work would show that natural-image embeddings can be applied off-the-shelf to document retrieval, removing the need for manual features or document-specific training data and thereby simplifying pipelines for large archival collections.
major comments (2)
- [Abstract] Abstract: the headline mAP figures (0.94 retrieval, 0.83 spotting) and the claim of favorable comparison to state-of-the-art document methods are presented without any description of network architecture, training protocol on ImageNet pairs, baseline re-implementations, or statistical significance tests, so the data-to-claim link cannot be verified.
- [Abstract] Abstract / evaluation protocol: the central assumption that similarity features learned from ImageNet natural-image pairs transfer directly to Tobacco800 documents without domain adaptation or document-specific fine-tuning is not tested by any ablation, feature-alignment analysis, or cross-domain experiment; this premise is load-bearing for the claim that the method outperforms document-tuned baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate. The full manuscript provides the requested methodological details in the body text; the abstract is a high-level summary.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline mAP figures (0.94 retrieval, 0.83 spotting) and the claim of favorable comparison to state-of-the-art document methods are presented without any description of network architecture, training protocol on ImageNet pairs, baseline re-implementations, or statistical significance tests, so the data-to-claim link cannot be verified.
Authors: The abstract is intentionally concise. Network architecture (Siamese backbone), ImageNet pair preparation and training protocol, baseline re-implementations, and experimental comparisons are fully described in Sections 3–5 of the manuscript. We will revise the abstract to include a short clause referencing the Siamese architecture and ImageNet-only training to strengthen the data-to-claim linkage at the summary level. revision: partial
-
Referee: [Abstract] Abstract / evaluation protocol: the central assumption that similarity features learned from ImageNet natural-image pairs transfer directly to Tobacco800 documents without domain adaptation or document-specific fine-tuning is not tested by any ablation, feature-alignment analysis, or cross-domain experiment; this premise is load-bearing for the claim that the method outperforms document-tuned baselines.
Authors: The manuscript's central experiment is exactly this direct transfer test: a model trained exclusively on ImageNet pairs is evaluated on Tobacco800 without any document fine-tuning or adaptation, and it outperforms prior document-specific methods. This constitutes the cross-domain evidence. While an explicit ablation comparing an ImageNet-trained model against a Tobacco800-trained counterpart is absent, the reported results already isolate the transfer benefit. We will add a dedicated discussion paragraph on domain transfer implications. revision: partial
Circularity Check
No circularity; empirical transfer evaluated on external public benchmark
full rationale
The paper trains a standard Siamese network on ImageNet image pairs and applies the resulting embeddings to the independent Tobacco800 document collection for retrieval and pattern spotting, reporting mAP against external SOTA baselines. No equations, fitted parameters, or self-citations are presented that reduce the headline mAP figures (0.94/0.83) to definitions or inputs of the same quantities by construction. The derivation chain consists of off-the-shelf network training followed by direct feature extraction and ranking on a held-out public dataset; the domain-transfer assumption is an empirical claim open to falsification rather than a self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- feature-map dimensionality
axioms (1)
- domain assumption Features learned on ImageNet pairs generalize to document images for retrieval.
Reference graph
Works this paper leans on
-
[1]
Deep binary codes for large scale image retrieval,
S. Wu, A. Oerlemans, E. M. Bakker, and M. S. Lew, “Deep binary codes for large scale image retrieval,” Neurocomputing, 2017
work page 2017
-
[2]
Large-scale image retrieval with supervised sparse hashing,
Y . Xu, F. Shen, X. Xu, L. Gao, Y . Wang, and X. Tan, “Large-scale image retrieval with supervised sparse hashing,” Neurocomputing, vol. 229, pp. 45 – 53, 2017
work page 2017
-
[3]
A scalable pattern spotting system for historical documents,
S. En, C. Petitjean, S. Nicolas, and L. Heutte, “A scalable pattern spotting system for historical documents,” Pattern Recognition, vol. 54, pp. 149–161, 2016
work page 2016
-
[4]
Recognition and analysis of objects in medieval images,
P . Y arlagadda, A. Monroy, B. Carque, and B. Ommer, “Recognition and analysis of objects in medieval images,” in ACCV 2010 International Workshops, R. Koch and F. Huang, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 296–305
work page 2010
-
[5]
Logo matching for document image re- trieval,
G. Zhu and D. Doermann, “Logo matching for document image re- trieval,” in 2009 10th International Conference on Document Analysis and Recognition , 2009, pp. 606–610
work page 2009
-
[6]
Video google: a text retrieval approach to object matching in videos,
Sivic and Zisserman, “Video google: a text retrieval approach to object matching in videos,” in Proceedings Ninth IEEE International Confer- ence on Computer Vision , Oct 2003, pp. 1470–1477 vol.2
work page 2003
-
[7]
Aggregating local deep features for image retrieval,
A. Babenko and V . Lempitsky, “Aggregating local deep features for image retrieval,” in The IEEE International Conference on Computer Vision (ICCV), December 2015
work page 2015
-
[8]
Exploiting local features from deep networks for image retrieval,
J. Y ue-Hei, N. F. Y ang, and L. S. Davis, “Exploiting local features from deep networks for image retrieval,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 53–61
work page 2015
-
[9]
Deep image retrieval: Learning global representations for image search,
A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in Computer Vision ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI , B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 241–257
work page 2016
-
[10]
Grading image retrieval based on cnn deep features,
Y . W. Luo, Y . Li, F. J. Han, and S. B. Huang, “Grading image retrieval based on cnn deep features,” in 2018 20th International Conference on Advanced Communication Technology (ICACT), Feb 2018, pp. 148–152
work page 2018
-
[11]
Document image retrieval using deep features,
K. L. Wiggers, A. S. Britto Jr., A. L. Koerich, L. Heutte, and L. E. S. Oliveira, “Document image retrieval using deep features,” in Interna- tional Joint Conference on Neural Networks (IJCNN) , vol. 1, Rio de Janeiro, 2018, pp. 3185–3192
work page 2018
-
[12]
Siamese neural networks for one-shot image recognition,
G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML 2015 Deep Learning Workshop , 2015
work page 2015
-
[13]
Face recognition based on convolution siamese networks,
H. Wu, Z. Xu, J. Zhang, W. Y an, and X. Ma, “Face recognition based on convolution siamese networks,” in 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Oct 2017, pp. 1–5
work page 2017
-
[14]
Digital libraries and document im- age retrieval techniques: A survey,
S. Marinai, B. Miotti, and G. Soda, “Digital libraries and document im- age retrieval techniques: A survey,” in Learning Structure and Schemas from Documents , M. Biba and F. Xhafa, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 181–204
work page 2011
-
[15]
Image retrieval based on image-to-class similarity,
J. Chen, Y . Wang, L. Luo, J.-G. Y u, and J. Ma, “Image retrieval based on image-to-class similarity,” Pattern Recognition Letters , vol. 83, Part 3, pp. 379 – 387, 2016
work page 2016
-
[16]
An ef ficient semantic – related image retrieval method,
Q. D. T. Thuy, Q. N. Huu, C. P . V an, and T. N. Quoc, “An ef ficient semantic – related image retrieval method,” Expert Systems with Appli- cations, vol. 72, pp. 30 – 41, 2017
work page 2017
-
[17]
Historical manuscript dating based on temporal pattern codebook,
S. He, P . Samara, J. Burgers, and L. Schomaker, “Historical manuscript dating based on temporal pattern codebook,” Computer Vision and Image Understanding, vol. 152, pp. 167 – 175, 2016
work page 2016
-
[18]
Logo detection using painting based representation and probability features,
A. Alaei, M. Delalandre, and N. Girard, “Logo detection using painting based representation and probability features,” in 12th International Conference on Document Analysis and Recognition , vol. 1236-1239, 2013
work page 2013
-
[19]
Region proposal for pattern spotting in historical document images,
S. En, C. Petitjean, S. Nicolas, L. Heutte, and F. Jurie, “Region proposal for pattern spotting in historical document images,” in 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Oct 2016, pp. 367–372
work page 2016
-
[20]
Selective search for object recognition,
J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeul- ders, “Selective search for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, 2013
work page 2013
-
[21]
Edge boxes: Locating object proposals from edges,
C. L. Zitnick and P . Dollár, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014
work page 2014
-
[22]
BING: Binarized normed gradients for objectness estimation at 300fps,
M.-M. Cheng, Z. Zhang, W.-Y . Lin, and P . H. S. Torr, “BING: Binarized normed gradients for objectness estimation at 300fps,” in IEEE CVPR , 2014
work page 2014
-
[23]
Using very deep autoencoders for content-based image retrieval
A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders for content-based image retrieval.” in ESANN, 2011
work page 2011
-
[24]
Supervised hashing for image retrieval via image representation learning,
R. Xia, Y . Pan, H. Lai, C. Liu, and S. Y an, “Supervised hashing for image retrieval via image representation learning,” in Proceedings of the Twenty-Eighth AAAI Conference on Arti ficial Intelligence . AAAI Press, 2014, pp. 2156–2162
work page 2014
-
[25]
Neural codes for image retrieval,
A. Babenko, A. Slesarev, A. Chigorin, and V . Lempitsky, “Neural codes for image retrieval,” in Computer Vision – ECCV 2014 , D. Fleet, T. Pa- jdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 584–599
work page 2014
-
[26]
Facenet: A uni fied embed- ding for face recognition and clustering,
J. P . Florian Schroff, Dmitry Kalenichenko, “Facenet: A uni fied embed- ding for face recognition and clustering,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015, pp. 815–823
work page 2015
-
[27]
Class-balanced siamese neural networks,
S. Berlemont, G. Lefebvre, S. Duffner, and C. Garcia, “Class-balanced siamese neural networks,” Neurocomputing, vol. 273, pp. 47 – 56, 2018
work page 2018
-
[28]
Sig- nature veri fication using a
J. Bromley, I. Guyon, Y . LeCun, E. Säckinger, and R. Shah, “Sig- nature veri fication using a "siamese" time delay neural network,” in Proceedings of the 6th International Conference on Neural Information Processing Systems . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993, pp. 737–744
work page 1993
-
[29]
Sketch-based image retrieval via siamese convolutional neural network,
Y . Qi, Y . Song, H. Zhang, and J. Liu, “Sketch-based image retrieval via siamese convolutional neural network,” in 2016 IEEE International Conference on Image Processing (ICIP) , Sept 2016, pp. 2460–2464
work page 2016
-
[30]
Y .-A. Chung and W.-H. Weng, “Learning deep representations of medi- cal images using siamese cnns with application to content-based image retrieval,” in Proceedings of the 31st Conference on Neural Information Processing Systems - NIPS 2017 , 11 2017
work page 2017
-
[31]
Learning deep representations for ground-to-aerial geolocalization,
T. Lin, Y . Cui, S. Belongie, and J. Hays, “Learning deep representations for ground-to-aerial geolocalization,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5007– 5015
work page 2015
-
[32]
Siamese network features for image matching,
I. Melekhov, J. Kannala, and E. Rahtu, “Siamese network features for image matching,” in 2016 23rd International Conference on Pattern Recognition (ICPR) , Dec 2016, pp. 378–383
work page 2016
-
[33]
Imagenet classi fication with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi fication with deep convolutional neural networks,” in Advances in Neural Infor- mation Processing Systems , 2012
work page 2012
-
[34]
Caffe: Convolutional Architecture for Fast Feature Embedding
Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[35]
Hogwild: A lock-free approach to parallelizing stochastic gradient descent,
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural In- formation Processing Systems 24 , J. Shawe-taylor, R. Zemel, P . Bartlett, F. Pereira, and K. Weinberger, Eds., 2011, pp. 693–701
work page 2011
-
[36]
Learning effective binary descriptors via cross entropy,
L. Liu and H. Qi, “Learning effective binary descriptors via cross entropy,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2017, pp. 1251–1258
work page 2017
-
[37]
Optimal decisions from probabilistic models: the intersection-over-union case,
S. Nowozin, “Optimal decisions from probabilistic models: the intersection-over-union case,” in Computer Vision and Pattern Recog- nition (CVPR 2014) . IEEE Computer Society, June 2014
work page 2014
-
[38]
Building a test collection for complex document information processing,
D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a test collection for complex document information processing,” in Proc. 29th Annual Int. ACM SIGIR Conference (SIGIR 2006), 2006, pp. 665–666
work page 2006
-
[39]
Logo retrieval in document images,
R. Jain and D. Doermann, “Logo retrieval in document images,” in 2012 10th IAPR International Workshop on Document Analysis Systems , 2012, pp. 135–139
work page 2012
-
[40]
Ef ficient logo retrieval through hashing shape context descriptors,
M. Rusinol and J. Lladós, “Ef ficient logo retrieval through hashing shape context descriptors,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , 2010, pp. 215–222
work page 2010
-
[41]
V . P . Le, M. Visani, C. D. Tran, and J. M. Ogier, “Improving logo spotting and matching for document categorization by a post- filter based on homography,” in 2013 12th International Conference on Document Analysis and Recognition , 2013, pp. 270–274
work page 2013
-
[42]
Document retrieval based on logo spotting using key-point matching,
V . P . Le, N. Nayef, M. Visani, J.-M. Ogier, and C. D. Tran, “Document retrieval based on logo spotting using key-point matching,” in 2014 22nd International Conference on Pattern Recognition , 2014, pp. 3056–3061
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.