pith. sign in

arxiv: 1907.00937 · v1 · pith:VUHGIRU2new · submitted 2019-07-01 · 💻 cs.IR · cs.CL

Semantic Product Search

Pith reviewed 2026-05-25 11:27 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords semantic searchproduct searchdeep learningcustomer behaviorloss functione-commerce retrievalneural matchinginformation retrieval
0
0 comments X

The pith

A deep learning model trained on customer purchase logs improves semantic product search recall by at least 4.7 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that lexical search fails on synonyms, morphological variants, and spelling errors in product catalogs, and that a neural model trained directly on customer behavior data can close the gap. It introduces a specialized loss function that separates random negatives from impressed-but-unpurchased items and true purchases, combined with n-gram average pooling and token hashing. These choices produce measurable gains in Recall@100 and mean average precision on offline test sets, with further confirmation from live A/B experiments. The work matters because e-commerce queries are short, noisy, and intent-driven, so better semantic matching directly affects how many relevant products reach customers.

Core claim

We train a deep learning model for semantic matching in product search using customer behavior data. By developing a new loss function with an inbuilt threshold for random negatives, impressed but unpurchased items, and positives, along with average pooling over n-grams and hashing for out-of-vocabulary tokens, the model achieves at least 4.7% better Recall@100 and 14.5% better MAP than state-of-the-art semantic search baselines using the same tokenization.

What carries the argument

A loss function with an inbuilt threshold that differentiates random negative examples, impressed but not purchased examples, and positive purchased examples.

If this is right

  • Semantic matching trained on behavior data can retrieve relevant products that lexical indexes miss due to synonyms or spelling variation.
  • The thresholded loss allows the model to learn graded relevance without treating all non-purchases as equal negatives.
  • Model-parallel training across eight GPUs makes the approach feasible for catalogs with millions of products.
  • Online A/B tests can directly measure lift in user engagement metrics when the model is deployed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If impression position bias is not removed from the logs, the embeddings may over-weight popular products regardless of true semantic fit.
  • Hashing for out-of-vocabulary tokens lets the model handle new brand names or typos without expanding the vocabulary.
  • Average pooling over n-grams may be especially suited to the short, keyword-like queries typical in product search.
  • The same loss and pooling design could be tested on other catalog domains where click or purchase logs exist.

Load-bearing premise

Customer behavior logs from impressions without purchase versus purchases supply an unbiased and sufficiently dense signal of semantic relatedness between queries and products.

What would settle it

Run the trained model on a held-out set of queries drawn from a product category absent from the training logs and measure whether Recall@100 falls below the lexical baseline.

Figures

Figures reproduced from arXiv: 1907.00937 by Ankit Shingavi, Bing Yin, Choon Hui Teo, Hao Gu, Priyanka Nigam, Vihan Lakshman, Vijai Mohan, Weitian (Allen) Ding, Yiwei Song.

Figure 1
Figure 1. Figure 1: System architecture for augmenting product [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of neural network architecture used [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Score distribution histogram shows large overlap [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Score distribution shows clear separation between [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregation of different tokenization methods il [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training time with various embedding dimensions [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

We study the problem of semantic matching in product search, that is, given a customer query, retrieve all semantically related products from the catalog. Pure lexical matching via an inverted index falls short in this respect due to several factors: a) lack of understanding of hypernyms, synonyms, and antonyms, b) fragility to morphological variants (e.g. "woman" vs. "women"), and c) sensitivity to spelling errors. To address these issues, we train a deep learning model for semantic matching using customer behavior data. Much of the recent work on large-scale semantic search using deep learning focuses on ranking for web search. In contrast, semantic matching for product search presents several novel challenges, which we elucidate in this paper. We address these challenges by a) developing a new loss function that has an inbuilt threshold to differentiate between random negative examples, impressed but not purchased examples, and positive examples (purchased items), b) using average pooling in conjunction with n-grams to capture short-range linguistic patterns, c) using hashing to handle out of vocabulary tokens, and d) using a model parallel training architecture to scale across 8 GPUs. We present compelling offline results that demonstrate at least 4.7% improvement in Recall@100 and 14.5% improvement in mean average precision (MAP) over baseline state-of-the-art semantic search methods using the same tokenization method. Moreover, we present results and discuss learnings from online A/B tests which demonstrate the efficacy of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to address semantic matching in product search by training a deep neural model on customer behavior logs, using a novel loss with an inbuilt threshold to separate purchased positives, impressed-but-not-purchased examples, and random negatives; it also employs average pooling over n-grams, hashing for OOV tokens, and model-parallel training across 8 GPUs. It reports at least 4.7% relative improvement in Recall@100 and 14.5% in MAP over prior semantic search baselines (using identical tokenization) in offline tests on held-out logs, plus positive results from online A/B tests.

Significance. If the performance gains prove robust, the work would be significant for e-commerce retrieval, as it directly tackles lexical matching failures on synonyms, morphology, and spelling while scaling to large catalogs via practical engineering choices. The behavioral-data-driven loss and online validation are pragmatic strengths that could influence production systems, though the magnitude of gains would need independent confirmation.

major comments (2)
  1. [Abstract] Abstract: the central claim of 4.7% Recall@100 and 14.5% MAP gains is presented without any description of the experimental protocol, baseline implementations, statistical tests, ablation studies, or dataset statistics, rendering the performance numbers unverifiable from the supplied text.
  2. [Abstract] Abstract (loss and data section implied): the loss treats purchases as positives, non-purchase impressions as an intermediate class, and random items as negatives via an inbuilt threshold, yet the manuscript supplies no inverse-propensity scoring, result randomization, or explicit controls for position bias and popularity effects; if these confounds dominate the labels, the reported offline and online lifts may reflect training-distribution artifacts rather than improved semantic matching.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'using the same tokenization method' for baselines is stated without defining or citing the tokenization procedure itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, providing clarifications from the full manuscript and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 4.7% Recall@100 and 14.5% MAP gains is presented without any description of the experimental protocol, baseline implementations, statistical tests, ablation studies, or dataset statistics, rendering the performance numbers unverifiable from the supplied text.

    Authors: The abstract is a concise summary of contributions and results. Full details on the experimental protocol (including held-out log evaluation, identical tokenization for baselines, statistical tests, ablation studies on loss components and pooling, and dataset statistics such as query-product pair volumes) appear in the Experiments and Results sections of the manuscript. We will revise the abstract to include a one-sentence reference to the offline evaluation setup and online A/B validation for improved verifiability. revision: partial

  2. Referee: [Abstract] Abstract (loss and data section implied): the loss treats purchases as positives, non-purchase impressions as an intermediate class, and random items as negatives via an inbuilt threshold, yet the manuscript supplies no inverse-propensity scoring, result randomization, or explicit controls for position bias and popularity effects; if these confounds dominate the labels, the reported offline and online lifts may reflect training-distribution artifacts rather than improved semantic matching.

    Authors: The loss is explicitly designed around observed customer behavior (purchases as positives, non-purchase impressions as intermediate, random negatives for contrast) without inverse-propensity scoring. Position and popularity biases are inherent to logged data; however, the online A/B tests randomize model exposure in production and still show lifts, providing evidence that gains are not solely artifacts. We will add a paragraph in the Discussion section acknowledging these potential confounds and the role of randomized online validation in mitigating concerns. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed results or method

full rationale

The paper trains embeddings on customer behavior logs (purchases as positives, impressions as intermediate, random as negatives) and reports standard retrieval metrics (Recall@100, MAP) on held-out logs plus online A/B tests. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters of the same data; the evaluation uses external baselines and held-out test sets, keeping the central empirical claim independent of its training inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that purchase behavior is a clean proxy for semantic similarity and on standard neural-network training assumptions; no new physical entities are postulated.

free parameters (1)
  • loss threshold
    Inbuilt threshold separating random, impressed, and purchased examples; its concrete value is not stated in the abstract.
axioms (1)
  • domain assumption Customer purchase and impression logs provide an unbiased signal of query-product semantic relatedness
    Used to label positives and the two classes of negatives for the custom loss.

pith-pipeline@v0.9.0 · 5826 in / 1312 out tokens · 22207 ms · 2026-05-25T11:27:49.320505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 7 internal anchors

  1. [1]

    Michael W Berry and Paul G Young. 1995. Using latent semantic indexing for multilanguage information retrieval. Computers and the Humanities 29, 6 (1995), 413–429

  2. [2]

    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022

  3. [3]

    Silviu Cucerzan and Eric Brill. 2004. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

  4. [4]

    Hercules Dalianis. 2002. Evaluating a spelling support in a search engine. In In- ternational Conference on Application of Natural Language to Information Systems . Springer, 183–190

  5. [5]

    Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41, 6 (1990), 391–407

  6. [6]

    Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891 (2016)

  7. [7]

    Susan T Dumais, Todd A Letsche, Michael L Littman, and Thomas K Landauer

  8. [8]

    InAAAI spring symposium on cross-language text and speech retrieval , Vol

    Automatic cross-language retrieval using latent semantic indexing. InAAAI spring symposium on cross-language text and speech retrieval , Vol. 15. 21

  9. [9]

    Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Con- ference on Artificial Intelligence and Statistics (AISTATS?10). Society for Artificial Intelligence and Statistics

  10. [10]

    Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management . ACM, 55–64

  11. [11]

    Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, USA, 2042–2050. http: //dl.acm.org/citation.cfm?id=2969033.2969055

  12. [12]

    Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management . ACM, 2333–2338

  13. [13]

    Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. Pacrr: A position-aware neural ir model for relevance matching. arXiv preprint arXiv:1704.03940 (2017)

  14. [14]

    Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. Re-pacrr: A context and density-aware neural information retrieval model. arXiv preprint arXiv:1706.10192 (2017)

  15. [15]

    Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2018. Co-pacrr: A context-aware neural ir model for ad-hoc retrieval. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining . ACM, 279–287

  16. [16]

    Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  17. [17]

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 427–431

  18. [18]

    CD Manning, R PRABHAKAR, and S HINRICH. 2008. Introduction to information retrieval, volume 1 Cambridge University Press. Cambridge, UK (2008)

  19. [19]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems . 3111–3119

  20. [20]

    Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1291–1299

  21. [21]

    Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. 2016. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137 (2016)

  22. [22]

    Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24, 4 (2016), 694–707

  23. [23]

    Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng

  24. [24]

    Text Matching as Image Recognition.. In AAAI. 2793–2799

  25. [25]

    Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing.International Journal of Approximate Reasoning 50, 7 (2009), 969–978

  26. [26]

    Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management . ACM, 101–110

  27. [27]

    Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, and Liliana Chanona-Hernández. 2014. Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications 41, 3 (2014), 853–860

  28. [28]

    Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng

  29. [29]

    Match-SRNN: Modeling the Recursive Matching Structure with Spatial RNN

    Match-srnn: Modeling the recursive matching structure with spatial rnn. arXiv preprint arXiv:1604.04378 (2016)

  30. [30]

    Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval . ACM, 178–185

  31. [31]

    Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. 2009. Feature hashing for large scale multitask learning. arXiv preprint arXiv:0902.2206 (2009)

  32. [32]

    Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. 2016. aNMM: Ranking short answer texts with attention-based neural matching model. In Proceed- ings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 287–296

  33. [33]

    Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines.ACM computing surveys (CSUR) 38, 2 (2006), 6. Table 7: Shared versus Decoupled Embeddings for Query and Product Tokenization Loss Shared Recall MAP Matching NDCG Matching MRR Ranking NDCG Ranking MRR Unigrams BCE F 0.520 0.418 0.649 0.420 0.692 0.953 T 0.586 0.486 0.695 0.473 0...