pith. sign in

arxiv: 1906.09513 · v1 · pith:OQTKLTKFnew · submitted 2019-06-22 · 💻 cs.CV

Image Retrieval and Pattern Spotting using Siamese Neural Network

Pith reviewed 2026-05-25 17:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords image retrievalpattern spottingsiamese neural networkdocument image analysissimilarity learningtobacco800 dataset
0
0 comments X

The pith

A Siamese neural network trained only on natural image pairs can retrieve and spot patterns in document images with high accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that similarity features learned by a Siamese network from ImageNet pairs transfer effectively to document images. This would allow retrieval and pattern spotting in document collections without manual features or document-specific training. The method is evaluated on the Tobacco800 collection, where it achieves strong results against other approaches. Reducing the size of the learned feature maps is also tested for its effect on speed and accuracy.

Core claim

The central claim is that a Siamese Neural Network trained on a subset of image pairs from the ImageNet dataset learns a similarity-based representation. This representation provides feature maps that find relevant document image candidates given a query, leading to 0.94 mAP for retrieval and 0.83 mAP for pattern spotting at IoU=0.7 on the Tobacco800 dataset, outperforming state-of-the-art document image retrieval methods.

What carries the argument

Siamese Neural Network trained on image pairs to produce similarity-based feature maps for matching.

If this is right

  • The learned features support both whole-image retrieval and localized pattern spotting.
  • Performance holds with varying feature map sizes, trading some accuracy for reduced computation.
  • Manual feature engineering can be replaced by this learned similarity approach in document collections.
  • Results suggest the method applies to public document image datasets without additional adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar transfer might work for other specialized image domains like medical scans or historical archives.
  • The same model could be tested on retrieval tasks outside documents to check cross-domain generality.
  • It raises the question of whether document-specific training data is needed at all for similarity-based matching.

Load-bearing premise

Similarity features learned from ImageNet natural-image pairs transfer directly to document images without further domain adaptation or document-specific training data.

What would settle it

Substantially lower mAP scores when the same network is tested on Tobacco800 after training on document image pairs instead would challenge the direct transfer claim.

read the original abstract

This paper presents a novel approach for image retrieval and pattern spotting in document image collections. The manual feature engineering is avoided by learning a similarity-based representation using a Siamese Neural Network trained on a previously prepared subset of image pairs from the ImageNet dataset. The learned representation is used to provide the similarity-based feature maps used to find relevant image candidates in the data collection given an image query. A robust experimental protocol based on the public Tobacco800 document image collection shows that the proposed method compares favorably against state-of-the-art document image retrieval methods, reaching 0.94 and 0.83 of mean average precision (mAP) for retrieval and pattern spotting (IoU=0.7), respectively. Besides, we have evaluated the proposed method considering feature maps of different sizes, showing the impact of reducing the number of features in the retrieval performance and time-consuming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that a Siamese Neural Network trained solely on pairs from the ImageNet dataset can learn transferable similarity features for image retrieval and pattern spotting on document images. Using the public Tobacco800 collection, it reports mean average precision of 0.94 for retrieval and 0.83 for pattern spotting (at IoU=0.7), states that these results compare favorably to prior document-specific methods, and examines the effect of reducing feature-map dimensionality on accuracy and runtime.

Significance. If the reported mAP numbers are reproducible and the domain transfer holds, the work would show that natural-image embeddings can be applied off-the-shelf to document retrieval, removing the need for manual features or document-specific training data and thereby simplifying pipelines for large archival collections.

major comments (2)
  1. [Abstract] Abstract: the headline mAP figures (0.94 retrieval, 0.83 spotting) and the claim of favorable comparison to state-of-the-art document methods are presented without any description of network architecture, training protocol on ImageNet pairs, baseline re-implementations, or statistical significance tests, so the data-to-claim link cannot be verified.
  2. [Abstract] Abstract / evaluation protocol: the central assumption that similarity features learned from ImageNet natural-image pairs transfer directly to Tobacco800 documents without domain adaptation or document-specific fine-tuning is not tested by any ablation, feature-alignment analysis, or cross-domain experiment; this premise is load-bearing for the claim that the method outperforms document-tuned baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate. The full manuscript provides the requested methodological details in the body text; the abstract is a high-level summary.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline mAP figures (0.94 retrieval, 0.83 spotting) and the claim of favorable comparison to state-of-the-art document methods are presented without any description of network architecture, training protocol on ImageNet pairs, baseline re-implementations, or statistical significance tests, so the data-to-claim link cannot be verified.

    Authors: The abstract is intentionally concise. Network architecture (Siamese backbone), ImageNet pair preparation and training protocol, baseline re-implementations, and experimental comparisons are fully described in Sections 3–5 of the manuscript. We will revise the abstract to include a short clause referencing the Siamese architecture and ImageNet-only training to strengthen the data-to-claim linkage at the summary level. revision: partial

  2. Referee: [Abstract] Abstract / evaluation protocol: the central assumption that similarity features learned from ImageNet natural-image pairs transfer directly to Tobacco800 documents without domain adaptation or document-specific fine-tuning is not tested by any ablation, feature-alignment analysis, or cross-domain experiment; this premise is load-bearing for the claim that the method outperforms document-tuned baselines.

    Authors: The manuscript's central experiment is exactly this direct transfer test: a model trained exclusively on ImageNet pairs is evaluated on Tobacco800 without any document fine-tuning or adaptation, and it outperforms prior document-specific methods. This constitutes the cross-domain evidence. While an explicit ablation comparing an ImageNet-trained model against a Tobacco800-trained counterpart is absent, the reported results already isolate the transfer benefit. We will add a dedicated discussion paragraph on domain transfer implications. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical transfer evaluated on external public benchmark

full rationale

The paper trains a standard Siamese network on ImageNet image pairs and applies the resulting embeddings to the independent Tobacco800 document collection for retrieval and pattern spotting, reporting mAP against external SOTA baselines. No equations, fitted parameters, or self-citations are presented that reduce the headline mAP figures (0.94/0.83) to definitions or inputs of the same quantities by construction. The derivation chain consists of off-the-shelf network training followed by direct feature extraction and ranking on a held-out public dataset; the domain-transfer assumption is an empirical claim open to falsification rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on transfer of similarity features from natural images to documents plus the assumption that mAP on Tobacco800 reflects real retrieval utility.

free parameters (1)
  • feature-map dimensionality
    Paper varies this size and reports impact on performance and speed, indicating it is chosen rather than derived.
axioms (1)
  • domain assumption Features learned on ImageNet pairs generalize to document images for retrieval.
    Invoked by training exclusively on ImageNet pairs then testing on Tobacco800 without domain adaptation.

pith-pipeline@v0.9.0 · 5692 in / 1093 out tokens · 35155 ms · 2026-05-25T17:49:36.625907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Deep binary codes for large scale image retrieval,

    S. Wu, A. Oerlemans, E. M. Bakker, and M. S. Lew, “Deep binary codes for large scale image retrieval,” Neurocomputing, 2017

  2. [2]

    Large-scale image retrieval with supervised sparse hashing,

    Y . Xu, F. Shen, X. Xu, L. Gao, Y . Wang, and X. Tan, “Large-scale image retrieval with supervised sparse hashing,” Neurocomputing, vol. 229, pp. 45 – 53, 2017

  3. [3]

    A scalable pattern spotting system for historical documents,

    S. En, C. Petitjean, S. Nicolas, and L. Heutte, “A scalable pattern spotting system for historical documents,” Pattern Recognition, vol. 54, pp. 149–161, 2016

  4. [4]

    Recognition and analysis of objects in medieval images,

    P . Y arlagadda, A. Monroy, B. Carque, and B. Ommer, “Recognition and analysis of objects in medieval images,” in ACCV 2010 International Workshops, R. Koch and F. Huang, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 296–305

  5. [5]

    Logo matching for document image re- trieval,

    G. Zhu and D. Doermann, “Logo matching for document image re- trieval,” in 2009 10th International Conference on Document Analysis and Recognition , 2009, pp. 606–610

  6. [6]

    Video google: a text retrieval approach to object matching in videos,

    Sivic and Zisserman, “Video google: a text retrieval approach to object matching in videos,” in Proceedings Ninth IEEE International Confer- ence on Computer Vision , Oct 2003, pp. 1470–1477 vol.2

  7. [7]

    Aggregating local deep features for image retrieval,

    A. Babenko and V . Lempitsky, “Aggregating local deep features for image retrieval,” in The IEEE International Conference on Computer Vision (ICCV), December 2015

  8. [8]

    Exploiting local features from deep networks for image retrieval,

    J. Y ue-Hei, N. F. Y ang, and L. S. Davis, “Exploiting local features from deep networks for image retrieval,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 53–61

  9. [9]

    Deep image retrieval: Learning global representations for image search,

    A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in Computer Vision ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI , B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 241–257

  10. [10]

    Grading image retrieval based on cnn deep features,

    Y . W. Luo, Y . Li, F. J. Han, and S. B. Huang, “Grading image retrieval based on cnn deep features,” in 2018 20th International Conference on Advanced Communication Technology (ICACT), Feb 2018, pp. 148–152

  11. [11]

    Document image retrieval using deep features,

    K. L. Wiggers, A. S. Britto Jr., A. L. Koerich, L. Heutte, and L. E. S. Oliveira, “Document image retrieval using deep features,” in Interna- tional Joint Conference on Neural Networks (IJCNN) , vol. 1, Rio de Janeiro, 2018, pp. 3185–3192

  12. [12]

    Siamese neural networks for one-shot image recognition,

    G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML 2015 Deep Learning Workshop , 2015

  13. [13]

    Face recognition based on convolution siamese networks,

    H. Wu, Z. Xu, J. Zhang, W. Y an, and X. Ma, “Face recognition based on convolution siamese networks,” in 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Oct 2017, pp. 1–5

  14. [14]

    Digital libraries and document im- age retrieval techniques: A survey,

    S. Marinai, B. Miotti, and G. Soda, “Digital libraries and document im- age retrieval techniques: A survey,” in Learning Structure and Schemas from Documents , M. Biba and F. Xhafa, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 181–204

  15. [15]

    Image retrieval based on image-to-class similarity,

    J. Chen, Y . Wang, L. Luo, J.-G. Y u, and J. Ma, “Image retrieval based on image-to-class similarity,” Pattern Recognition Letters , vol. 83, Part 3, pp. 379 – 387, 2016

  16. [16]

    An ef ficient semantic – related image retrieval method,

    Q. D. T. Thuy, Q. N. Huu, C. P . V an, and T. N. Quoc, “An ef ficient semantic – related image retrieval method,” Expert Systems with Appli- cations, vol. 72, pp. 30 – 41, 2017

  17. [17]

    Historical manuscript dating based on temporal pattern codebook,

    S. He, P . Samara, J. Burgers, and L. Schomaker, “Historical manuscript dating based on temporal pattern codebook,” Computer Vision and Image Understanding, vol. 152, pp. 167 – 175, 2016

  18. [18]

    Logo detection using painting based representation and probability features,

    A. Alaei, M. Delalandre, and N. Girard, “Logo detection using painting based representation and probability features,” in 12th International Conference on Document Analysis and Recognition , vol. 1236-1239, 2013

  19. [19]

    Region proposal for pattern spotting in historical document images,

    S. En, C. Petitjean, S. Nicolas, L. Heutte, and F. Jurie, “Region proposal for pattern spotting in historical document images,” in 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Oct 2016, pp. 367–372

  20. [20]

    Selective search for object recognition,

    J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeul- ders, “Selective search for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, 2013

  21. [21]

    Edge boxes: Locating object proposals from edges,

    C. L. Zitnick and P . Dollár, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014

  22. [22]

    BING: Binarized normed gradients for objectness estimation at 300fps,

    M.-M. Cheng, Z. Zhang, W.-Y . Lin, and P . H. S. Torr, “BING: Binarized normed gradients for objectness estimation at 300fps,” in IEEE CVPR , 2014

  23. [23]

    Using very deep autoencoders for content-based image retrieval

    A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders for content-based image retrieval.” in ESANN, 2011

  24. [24]

    Supervised hashing for image retrieval via image representation learning,

    R. Xia, Y . Pan, H. Lai, C. Liu, and S. Y an, “Supervised hashing for image retrieval via image representation learning,” in Proceedings of the Twenty-Eighth AAAI Conference on Arti ficial Intelligence . AAAI Press, 2014, pp. 2156–2162

  25. [25]

    Neural codes for image retrieval,

    A. Babenko, A. Slesarev, A. Chigorin, and V . Lempitsky, “Neural codes for image retrieval,” in Computer Vision – ECCV 2014 , D. Fleet, T. Pa- jdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 584–599

  26. [26]

    Facenet: A uni fied embed- ding for face recognition and clustering,

    J. P . Florian Schroff, Dmitry Kalenichenko, “Facenet: A uni fied embed- ding for face recognition and clustering,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015, pp. 815–823

  27. [27]

    Class-balanced siamese neural networks,

    S. Berlemont, G. Lefebvre, S. Duffner, and C. Garcia, “Class-balanced siamese neural networks,” Neurocomputing, vol. 273, pp. 47 – 56, 2018

  28. [28]

    Sig- nature veri fication using a

    J. Bromley, I. Guyon, Y . LeCun, E. Säckinger, and R. Shah, “Sig- nature veri fication using a "siamese" time delay neural network,” in Proceedings of the 6th International Conference on Neural Information Processing Systems . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993, pp. 737–744

  29. [29]

    Sketch-based image retrieval via siamese convolutional neural network,

    Y . Qi, Y . Song, H. Zhang, and J. Liu, “Sketch-based image retrieval via siamese convolutional neural network,” in 2016 IEEE International Conference on Image Processing (ICIP) , Sept 2016, pp. 2460–2464

  30. [30]

    Learning deep representations of medi- cal images using siamese cnns with application to content-based image retrieval,

    Y .-A. Chung and W.-H. Weng, “Learning deep representations of medi- cal images using siamese cnns with application to content-based image retrieval,” in Proceedings of the 31st Conference on Neural Information Processing Systems - NIPS 2017 , 11 2017

  31. [31]

    Learning deep representations for ground-to-aerial geolocalization,

    T. Lin, Y . Cui, S. Belongie, and J. Hays, “Learning deep representations for ground-to-aerial geolocalization,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5007– 5015

  32. [32]

    Siamese network features for image matching,

    I. Melekhov, J. Kannala, and E. Rahtu, “Siamese network features for image matching,” in 2016 23rd International Conference on Pattern Recognition (ICPR) , Dec 2016, pp. 378–383

  33. [33]

    Imagenet classi fication with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi fication with deep convolutional neural networks,” in Advances in Neural Infor- mation Processing Systems , 2012

  34. [34]

    Caffe: Convolutional Architecture for Fast Feature Embedding

    Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093 , 2014

  35. [35]

    Hogwild: A lock-free approach to parallelizing stochastic gradient descent,

    B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural In- formation Processing Systems 24 , J. Shawe-taylor, R. Zemel, P . Bartlett, F. Pereira, and K. Weinberger, Eds., 2011, pp. 693–701

  36. [36]

    Learning effective binary descriptors via cross entropy,

    L. Liu and H. Qi, “Learning effective binary descriptors via cross entropy,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2017, pp. 1251–1258

  37. [37]

    Optimal decisions from probabilistic models: the intersection-over-union case,

    S. Nowozin, “Optimal decisions from probabilistic models: the intersection-over-union case,” in Computer Vision and Pattern Recog- nition (CVPR 2014) . IEEE Computer Society, June 2014

  38. [38]

    Building a test collection for complex document information processing,

    D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a test collection for complex document information processing,” in Proc. 29th Annual Int. ACM SIGIR Conference (SIGIR 2006), 2006, pp. 665–666

  39. [39]

    Logo retrieval in document images,

    R. Jain and D. Doermann, “Logo retrieval in document images,” in 2012 10th IAPR International Workshop on Document Analysis Systems , 2012, pp. 135–139

  40. [40]

    Ef ficient logo retrieval through hashing shape context descriptors,

    M. Rusinol and J. Lladós, “Ef ficient logo retrieval through hashing shape context descriptors,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , 2010, pp. 215–222

  41. [41]

    Improving logo spotting and matching for document categorization by a post- filter based on homography,

    V . P . Le, M. Visani, C. D. Tran, and J. M. Ogier, “Improving logo spotting and matching for document categorization by a post- filter based on homography,” in 2013 12th International Conference on Document Analysis and Recognition , 2013, pp. 270–274

  42. [42]

    Document retrieval based on logo spotting using key-point matching,

    V . P . Le, N. Nayef, M. Visani, J.-M. Ogier, and C. D. Tran, “Document retrieval based on logo spotting using key-point matching,” in 2014 22nd International Conference on Pattern Recognition , 2014, pp. 3056–3061