Adversarially Trained Deep Neural Semantic Hashing Scheme for Subjective Search in Fashion Inventory

Debdoot Sheet; Mithun Dasgupta; Saket Singh

arxiv: 1907.00382 · v1 · pith:FUJWMBNGnew · submitted 2019-06-30 · 💻 cs.CV · cs.LG· eess.IV

Adversarially Trained Deep Neural Semantic Hashing Scheme for Subjective Search in Fashion Inventory

Saket Singh , Debdoot Sheet , Mithun Dasgupta This is my paper

Pith reviewed 2026-05-25 12:54 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords semantic hashingadversarial learningfashion retrievaldeep neural networksHamming distanceimage searchsubjective similarityconvolutional neural network

0 comments

The pith

An adversarially trained CNN produces semantic hash codes for fashion images that achieve 90.65% mean average precision in subjective retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a hashing method to quickly find similar fashion items in large inventories by representing images as binary codes. It trains a convolutional neural network to classify clothing types while ensuring that images considered subjectively similar have hash codes with small Hamming distances and dissimilar ones have large distances. An adversarial component is added so that a discriminator cannot tell which hash code belongs to which image for similar pairs. This approach is tested on fashion inventory search and outperforms previous hashing methods. The result matters because traditional pixel comparisons are slow and sensitive to variations like pose and lighting, while hashing allows fast Hamming distance checks.

Core claim

The central claim is that an adversarially trained deep neural semantic hashing network, consisting of a CNN that minimizes clothing type classification error, minimizes Hamming distance between semantic neighbors while maximizing it for dissimilar images, and maximally scrambles a discriminator's ability to identify hash code-image pairs for semantically similar queries, enables effective subjective search in fashion inventories with a mean average precision of 90.65%.

What carries the argument

adversarially trained deep neural semantic hashing network that jointly optimizes classification, semantic Hamming distance, and adversarial discrimination

Load-bearing premise

The assumption that the combination of clothing type classification, Hamming distance minimization for semantic neighbors, and adversarial discrimination will produce hash codes that reliably place subjective neighbors within a tolerable Hamming radius.

What would settle it

Evaluation on a fashion dataset with independently validated subjective neighbor pairs showing that many such pairs have hash codes exceeding the expected Hamming distance threshold.

Figures

Figures reproduced from arXiv: 1907.00382 by Debdoot Sheet, Mithun Dasgupta, Saket Singh.

**Figure 2.** Figure 2: Figure shows the categorization of dataset. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Framework for learning of the deep neural semantic hashing scheme for subjective search across images. Blocks in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of various classes of clothing items in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: An example of images of the same clothing item [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Men inventory retrieval result [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Women Inventory retrieval result 5 [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Figure shows the relation between hamming dis [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: The t-SNE visualizations for the proposed architecture and its variants for hash codes generated using MVC dataset [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

The simple approach of retrieving a closest match of a query image from one in the gallery, compares an image pair using sum of absolute difference in pixel or feature space. The process is computationally expensive, ill-posed to illumination, background composition, pose variation, as well as inefficient to be deployed on gallery sets with more than 1000 elements. Hashing is a faster alternative which involves representing images in reduced dimensional simple feature spaces. Encoding images into binary hash codes enables similarity comparison in an image-pair using the Hamming distance measure. The challenge, however, lies in encoding the images using a semantic hashing scheme that lets subjective neighbors lie within the tolerable Hamming radius. This work presents a solution employing adversarial learning of a deep neural semantic hashing network for fashion inventory retrieval. It consists of a feature extracting convolutional neural network (CNN) learned to (i) minimize error in classifying type of clothing, (ii) minimize hamming distance between semantic neighbors and maximize distance between semantically dissimilar images, (iii) maximally scramble a discriminator's ability to identify the corresponding hash code-image pair when processing a semantically similar query-gallery image pair. Experimental validation for fashion inventory search yields a mean average precision (mAP) of 90.65% in finding the closest match as compared to 53.26% obtained by the prior art of deep Cauchy hashing for hamming space retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reported mAP jump is hard to trust because the paper never spells out how semantic neighbor pairs were built or validated.

read the letter

The paper combines clothing-type classification, Hamming-distance pull on semantic neighbors, and an adversarial term to make hash codes hard to invert. That joint objective is a straightforward extension of prior deep hashing work, and they report 90.65% mAP against 53.26% for deep Cauchy hashing on a fashion inventory task. The comparison itself is useful to see, and the adversarial component is the clearest addition they highlight. If the full experiments include architecture details, training curves, or at least the dataset size, those would be the parts worth noting for someone replicating the setup. The central problem is the one flagged in the stress-test note. The Hamming term only works if the positive pairs actually capture subjective similarity, yet the text gives no protocol for obtaining those pairs, no mention of metadata tags versus human ratings, and no check on consistency. Without that, the mAP number could simply reflect leakage from whatever signal was used to label the pairs in the first place. The rest of the method is standard CNN hashing with extra losses; nothing in the equations or setup looks formally novel or parameter-free. This is the sort of applied retrieval paper that might interest people building e-commerce search tools who already have labeled fashion data. It does not change how we think about hashing or adversarial objectives in general. I would send it to review so the pair-construction question can be asked directly and the experimental details can be inspected, but I would not cite it without those clarifications.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an adversarially trained deep neural semantic hashing scheme for subjective search in fashion inventory retrieval. A CNN is trained to classify clothing types, minimize Hamming distance between semantic neighbors while maximizing it for dissimilar images, and adversarially fool a discriminator on hash code-image pairs for semantically similar queries. The central empirical claim is an mAP of 90.65% for closest-match retrieval, compared to 53.26% for deep Cauchy hashing.

Significance. If the results hold after clarification of the missing components, the work would offer a practical advance in efficient Hamming-space retrieval for subjective similarity in large fashion galleries, where pixel/feature comparison is intractable. The joint objective of classification, Hamming loss, and adversarial training is a coherent design choice that could improve hash code quality over single-objective baselines.

major comments (2)

[Abstract] Abstract: The headline mAP improvement (90.65% vs 53.26%) rests on the claim that the joint objective places subjective neighbors inside a small Hamming radius, yet the manuscript supplies no protocol for constructing or validating the positive/negative semantic neighbor pairs used in the Hamming loss term (metadata tags, human annotations, clustering, etc.). This definition is load-bearing for interpreting the result as evidence of subjective similarity preservation rather than leakage of the training signal.
[Methods] Methods/Experimental section: No information is provided on network architecture, exact loss formulations and weighting for the three objectives, dataset size/splits/characteristics, training procedure, number of runs, or statistical tests. These omissions prevent any assessment of whether the reported mAP gain is reproducible or statistically meaningful.

minor comments (1)

[Abstract] Abstract: The comparison baseline is referred to only as 'deep Cauchy hashing' without a citation to the specific prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve transparency and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The headline mAP improvement (90.65% vs 53.26%) rests on the claim that the joint objective places subjective neighbors inside a small Hamming radius, yet the manuscript supplies no protocol for constructing or validating the positive/negative semantic neighbor pairs used in the Hamming loss term (metadata tags, human annotations, clustering, etc.). This definition is load-bearing for interpreting the result as evidence of subjective similarity preservation rather than leakage of the training signal.

Authors: We acknowledge that the manuscript does not currently specify the protocol for constructing or validating semantic neighbor pairs. This is an oversight. In the revision we will add an explicit description of the pair construction method (using available dataset metadata) together with any validation steps, allowing readers to confirm that the reported mAP reflects subjective similarity preservation rather than training-signal leakage. revision: yes
Referee: [Methods] Methods/Experimental section: No information is provided on network architecture, exact loss formulations and weighting for the three objectives, dataset size/splits/characteristics, training procedure, number of runs, or statistical tests. These omissions prevent any assessment of whether the reported mAP gain is reproducible or statistically meaningful.

Authors: We agree that these details are essential. The revised manuscript will expand the Methods and Experimental sections to include the CNN architecture, the precise mathematical forms and weighting coefficients of the three loss terms, dataset size/splits/characteristics, training hyperparameters and procedure, and results aggregated over multiple runs with appropriate statistical tests. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical mAP is independent experimental measurement

full rationale

The paper proposes a composite training objective for a CNN (clothing-type classification + Hamming loss on semantic neighbors + adversarial discriminator) and reports an experimental mAP of 90.65% on fashion retrieval, compared against a baseline. This mAP is a measured retrieval metric on held-out data and does not reduce by construction to any fitted parameter, self-definition, or self-citation chain. No equations or claims in the provided text exhibit self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation. The lack of detail on pair construction is an experimental-protocol issue, not a circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the text provided.

pith-pipeline@v0.9.0 · 5787 in / 1169 out tokens · 64859 ms · 2026-05-25T12:54:28.358070+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Yue Cao, Mingsheng Long, Bin Liu, and Jianmin Wang. 2018. Deep Cauchy Hashing for Hamming Space Retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2018
[2]

Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. Hash- Net: Deep Learning to Hash by Continuation. CoRR abs/1702.00758 (2017). arXiv:1702.00758

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (2006), 504–507. https://doi.org/10.1126/science.1127647 arXiv:http://science.sciencemag.org/content/313/5786/504.full.pdf

work page doi:10.1126/science.1127647 2006
[4]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti- mization. CoRR abs/1412.6980 (2014). arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classi- fication with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). Curran Associates Inc., USA, 1097–1105

work page 2012
[6]

LeCun, B

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1, 4 (Dec. 1989), 541–551. https://doi.org/10.1162/neco.1989.1.4. 541

work page doi:10.1162/neco.1989.1.4 1989
[7]

Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. 2015. Feature Learning based Deep Supervised Hashing with Pairwise Labels. CoRR abs/1511.03855 (2015). arXiv:1511.03855

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

K. Lin, H. Yang, J. Hsiao, and C. Chen. 2015. Deep learning of binary hash codes for fast image retrieval. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 27–35. https://doi.org/10.1109/CVPRW.2015. 7301269

work page doi:10.1109/cvprw.2015 2015
[9]

H. Liu, R. Wang, S. Shan, and X. Chen. 2016. Deep Supervised Hashing for Fast Image Retrieval. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2064–2072. https://doi.org/10.1109/CVPR.2016.227

work page doi:10.1109/cvpr.2016.227 2016
[10]

Kuan-Hsien Liu, Ting-Yen Chen, and Chu-Song Chen. 2016. MVC: A Dataset for View-Invariant Clothing Retrieval and Attribute Prediction. In ICMR

work page 2016
[11]

ImageNet Large Scale Visual Recognition Challenge,

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https: //doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[12]

Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing.International Journal of Approximate Reasoning 50, 7 (2009), 969–978. 8

work page 2009

[1] [1]

Yue Cao, Mingsheng Long, Bin Liu, and Jianmin Wang. 2018. Deep Cauchy Hashing for Hamming Space Retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2018

[2] [2]

Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. Hash- Net: Deep Learning to Hash by Continuation. CoRR abs/1702.00758 (2017). arXiv:1702.00758

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (2006), 504–507. https://doi.org/10.1126/science.1127647 arXiv:http://science.sciencemag.org/content/313/5786/504.full.pdf

work page doi:10.1126/science.1127647 2006

[4] [4]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti- mization. CoRR abs/1412.6980 (2014). arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classi- fication with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). Curran Associates Inc., USA, 1097–1105

work page 2012

[6] [6]

LeCun, B

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1, 4 (Dec. 1989), 541–551. https://doi.org/10.1162/neco.1989.1.4. 541

work page doi:10.1162/neco.1989.1.4 1989

[7] [7]

Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. 2015. Feature Learning based Deep Supervised Hashing with Pairwise Labels. CoRR abs/1511.03855 (2015). arXiv:1511.03855

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

K. Lin, H. Yang, J. Hsiao, and C. Chen. 2015. Deep learning of binary hash codes for fast image retrieval. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 27–35. https://doi.org/10.1109/CVPRW.2015. 7301269

work page doi:10.1109/cvprw.2015 2015

[9] [9]

H. Liu, R. Wang, S. Shan, and X. Chen. 2016. Deep Supervised Hashing for Fast Image Retrieval. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2064–2072. https://doi.org/10.1109/CVPR.2016.227

work page doi:10.1109/cvpr.2016.227 2016

[10] [10]

Kuan-Hsien Liu, Ting-Yen Chen, and Chu-Song Chen. 2016. MVC: A Dataset for View-Invariant Clothing Retrieval and Attribute Prediction. In ICMR

work page 2016

[11] [11]

ImageNet Large Scale Visual Recognition Challenge,

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https: //doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015

[12] [12]

Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing.International Journal of Approximate Reasoning 50, 7 (2009), 969–978. 8

work page 2009