pith. sign in

arxiv: 1907.01549 · v2 · pith:QVBMCLWQnew · submitted 2019-07-01 · 💻 cs.IR · cs.CL· cs.LG· stat.ML

Learning to Rank Broad and Narrow Queries in E-Commerce

Pith reviewed 2026-05-25 11:23 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LGstat.ML
keywords learning to ranke-commerce searchquery segmentationbroad queriesnarrow queriesfashion searchLETOR
0
0 comments X

The pith

Specialized models for broad and narrow queries outperform a combined model in fashion search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a learning-to-rank framework for e-commerce product search. It first segments queries into broad and narrow categories according to inferred user intent. Separate ranking models are then trained on each segment and shown to deliver higher performance than one model trained on the full set of queries. The work also shows how denoising auto-encoders and word embeddings address sparsity in product and query features. These steps matter because different query types reflect distinct shopping goals that a single model struggles to serve equally well.

Core claim

The central claim is that, on fashion-category data, distinct pointwise and pairwise LETOR models trained on broad queries alone and on narrow queries alone outperform a single combined model trained on all queries. Query segmentation is performed by analyzing user intent, features are drawn from query, product, and query-product sources, and sparsity is mitigated with a denoising auto-encoder for product features plus skip-gram embeddings for query-product matching. Multiple target metrics are compared for robustness.

What carries the argument

A query-segmentation mechanism that divides queries into broad versus narrow categories on the basis of user intent, used to train separate pointwise and pairwise learning-to-rank models.

If this is right

  • Feature importance patterns differ between broad-query and narrow-query models.
  • Target metrics can be evaluated for stability when ranking is split by query type.
  • Sparsity-handling techniques enable the use of otherwise unusable product and query features.
  • Pointwise and pairwise training both benefit from the segmentation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The segmentation step could be applied to non-fashion verticals if intent signals remain consistent.
  • Real-time query classification would be required for the specialized models to be deployed at scale.
  • Conversion or revenue metrics might improve if the ranking objective is aligned with the same broad-narrow split.

Load-bearing premise

The proposed way of dividing queries into broad versus narrow categories correctly reflects user intent and remains stable across product categories and time periods.

What would settle it

Train a single combined model on the full fashion query set and show that its ranking quality on a held-out test set equals or exceeds the quality of the two specialized models.

Figures

Figures reproduced from arXiv: 1907.01549 by Sagar Arora, Siddhartha Devapujula, Sumit Borar.

Figure 1
Figure 1. Figure 1: Query Distribution basis Coherency Scores For such queries on our platform we observe a more skewed distribution of traffic across queries, a 90-10 distribution arXiv:1907.01549v2 [cs.IR] 15 Jul 2019 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Coherency Score vs Recall Set Size Section 4 describes the system architecture and designs and Section 5 discusses the feature engineering and target vari￾ables. Finally, Section 6 describes our modeling approach, results and analysis. Our paper makes the following contri￾butions: 1. We provide an approach to segment our queries into broad and narrow basis how coherent the downstream sessions are. We show … view at source ↗
Figure 3
Figure 3. Figure 3: Architecture When a user issues a query, the retrieval layer renders a set of top-K (typically 1000) products based on BM-25 score. Traditional BM25 based approaches are quite effective for retrieval; however in broad queries like “tshirts”, it becomes extremely significant to include business metrics like CTR and conversion to optimize the ranking. Not just the de￾mand, it becomes critical for our platfor… view at source ↗
Figure 4
Figure 4. Figure 4: Denoising Autoencoder Architecture we compare this with normal autoencoder (without the noise layer). The autoencoder based approach would greatly help to reduce sparsity in features and further assist LETOR models to optimise ranking by learning weights (parameters) for different features. 3. Query Product Features The query product fea￾tures are features which involve both query and prod￾uct - eg ctr of … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of different models. All xla [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Broad Vs Narrow Clearly, LamdaMart performs the best across different fea￾ture combinations (except YYYN where RankNet performs Query Similar Attributes with cosine similarity nike dri-fit(0.72), adidas(0.64), puma(0.58), sportwear(0.55) baniyan (hindi word for vest) vests white(0.57), sando (0.513), cotton vest (0.51), innerwear(0.504) swimwear swimsuit(0.915), swimdress(0.718), tankini(0.701), bikini(0.6… view at source ↗
read the original abstract

Search is a prominent channel for discovering products on an e-commerce platform. Ranking products retrieved from search becomes crucial to address customer's need and optimize for business metrics. While learning to Rank (LETOR) models have been extensively studied and have demonstrated efficacy in the context of web search; it is a relatively new research area to be explored in the e-commerce. In this paper, we present a framework for building LETOR model for an e-commerce platform. We analyze user queries and propose a mechanism to segment queries between broad and narrow based on user's intent. We discuss different types of features - query, product and query-product and discuss challenges in using them. We show that sparsity in product features can be tackled through a denoising auto-encoder while skip-gram based word embeddings help solve the query-product sparsity issues. We also present various target metrics that can be employed for evaluating search results and compare their robustness. Further, we build and compare performances of both pointwise and pairwise LETOR models on fashion category data set. We also build and compare distinct models for broad and narrow queries, analyze feature importance across these and show that these specialized models perform better than a combined model in the fashion world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a LETOR framework for e-commerce search ranking. It proposes a mechanism to segment queries into broad versus narrow based on user intent, discusses query/product/query-product features and sparsity mitigation via denoising auto-encoders and skip-gram embeddings, compares target metrics, and evaluates pointwise and pairwise models on fashion data. The central empirical claim is that distinct models trained on broad and narrow queries outperform a single combined model.

Significance. If the segmentation is shown to be valid and the gains are robust, the result would offer a practical way to improve ranking quality in e-commerce by tailoring models to query breadth, with direct business relevance for fashion verticals. The sparsity-handling techniques are standard but appropriately applied; the comparison of evaluation metrics is a secondary contribution.

major comments (2)
  1. [Query segmentation mechanism (described in abstract and methods)] The headline result (specialized broad/narrow models outperform the combined model) is load-bearing on the claim that the proposed segmentation accurately captures user intent. The manuscript describes a mechanism but supplies no quantitative validation such as agreement with human labels, temporal stability, or cross-vertical consistency; without this, any reported lift could be an artifact of the partition rule rather than genuine intent differences.
  2. [Experimental results and evaluation (abstract and results sections)] The abstract states that comparative experiments were performed on fashion data and that specialized models perform better, yet reports no performance numbers, dataset sizes, error bars, or statistical tests. This absence prevents verification of effect sizes or robustness to metric selection and post-hoc choices.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence description of the segmentation heuristic and the magnitude of the observed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation and claims. We respond to each major comment below and commit to revisions where appropriate.

read point-by-point responses
  1. Referee: [Query segmentation mechanism (described in abstract and methods)] The headline result (specialized broad/narrow models outperform the combined model) is load-bearing on the claim that the proposed segmentation accurately captures user intent. The manuscript describes a mechanism but supplies no quantitative validation such as agreement with human labels, temporal stability, or cross-vertical consistency; without this, any reported lift could be an artifact of the partition rule rather than genuine intent differences.

    Authors: We agree that the segmentation's validity is central to the interpretation of results. The manuscript describes a rule-based mechanism using query characteristics to approximate intent differences, and the observed performance gains provide supporting evidence. However, we acknowledge the absence of direct quantitative validation. In the revised manuscript we will add an analysis of agreement between the segmentation and human labels on a sampled set of queries, along with checks for temporal stability. revision: yes

  2. Referee: [Experimental results and evaluation (abstract and results sections)] The abstract states that comparative experiments were performed on fashion data and that specialized models perform better, yet reports no performance numbers, dataset sizes, error bars, or statistical tests. This absence prevents verification of effect sizes or robustness to metric selection and post-hoc choices.

    Authors: We agree that the experimental reporting requires more detail for verifiability. While the results section contains model comparisons, specific numerical values, dataset sizes, error bars, and statistical tests are not presented with sufficient prominence. In the revision we will incorporate concrete performance numbers, dataset statistics, standard errors, and significance tests into both the abstract and results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical model comparison on held-out data

full rationale

The paper describes an empirical workflow: a segmentation heuristic for broad vs. narrow queries, feature engineering (including auto-encoders and embeddings), and training/comparison of pointwise and pairwise LETOR models on fashion data, with performance evaluated on held-out sets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim (specialized models outperform a combined model) is a direct empirical result rather than a reduction to inputs by construction. The segmentation step is an input assumption whose validity is external to the reported metrics, but this does not create definitional or self-referential circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that query intent can be reliably partitioned into broad and narrow categories and that standard sparsity-handling techniques transfer to product and query-product features.

axioms (1)
  • domain assumption User queries can be meaningfully segmented into broad and narrow based on intent
    Central to the proposed framework and to the claim that specialized models outperform a combined one.

pith-pipeline@v0.9.0 · 5749 in / 1189 out tokens · 22043 ms · 2026-05-25T11:23:10.489182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    Learning to Rank Broad and Narrow Queries in E-Commerce

    INTRODUCTION Users on an e-commerce platform typically discover prod- ucts through search, browsing categories or marketing cam- paigns. On our platform, search functionality is key to prod- uct discovery as each of these channels translates to a search query in the back-end. Search ranking is a critical aspect of our business. Hence any improvement in th...

  2. [2]

    We show that segmenting queries and training different models for each can be a better ap- proach than training single model across the board

    We provide an approach to segment our queries into broad and narrow basis how coherent the downstream sessions are. We show that segmenting queries and training different models for each can be a better ap- proach than training single model across the board

  3. [3]

    Apart from using typical query features, product fea- tures and query-product features, we propose a denois- ing autoencoder based architecture to reduce sparsity of product features and skip gram based word embed- dings for query-product features

  4. [4]

    Further, we study the behaviour of various target variables - CTR, Add to cart ratio, conversion and Revenue Per Impression

    We demonstrate the impact of various combinations of different types of features on the model’s perfor- mance. Further, we study the behaviour of various target variables - CTR, Add to cart ratio, conversion and Revenue Per Impression

  5. [5]

    We also show how our model significantly improves NDCG compared to the baseline model built upon style popularity

    We highlight the differences between broad and narrow queries in terms of modelling approach, feature impor- tance etc. We also show how our model significantly improves NDCG compared to the baseline model built upon style popularity. 1The mean search engine ranking position for all the clicked products

  6. [6]

    Various LETOR models like RankNet, Lam- daMart, AdaRank and RankBoost have been compared on Web Search data [3, 5–7]

    RELA TED WORK LETOR methods have demonstrated their success in web search [3, 4]. Various LETOR models like RankNet, Lam- daMart, AdaRank and RankBoost have been compared on Web Search data [3, 5–7]. Moreover search has been exten- sively studied in the e-commerce primarily from the retrieval perspective [8–12]. Karmaker et al. [2] attempted to apply LETO...

  7. [7]

    We have divided the queries into train and test in the ratio of 70:30

    BROAD AND NARROW QUERIES We randomly sampled 100k queries and labelled them as Broad/Narrow. We have divided the queries into train and test in the ratio of 70:30. Later we have trained a SVM clas- sifier with radial-basis kernel on the train queries. We have considered multiple sets of features - Word2Vec of query, result set size and identified query attr...

  8. [8]

    Clearly, it can be attributed to unnamed queries

    This corresponds to bin number 92; which further corre- sponds to a recall set size of 1910-2300. Clearly, it can be attributed to unnamed queries. So, a query with coherency score≤ 0.58 is referred to as broad query while query with coherency score > 0.58 is referred to as narrow query. The table 1 shows various statistics regarding both the segments. It...

  9. [9]

    We use SOLR as the underlying search engine with over 3M fashion products indexed

    SYSTEM DESIGN This section discusses the architecture (figure 3) involved in retrieval and ranking of search products on our platform. We use SOLR as the underlying search engine with over 3M fashion products indexed. Figure 3: Architecture When a user issues a query, the retrieval layer renders a set of top-K (typically 1000) products based on BM-25 score...

  10. [10]

    3The primary focus of this paper would be LETOR, instead

    Catalogue Data : Structured information regarding each product’s physical features like brand, color, mrp etc. 3The primary focus of this paper would be LETOR, instead

  11. [11]

    Transactional Data: Product’s output business met- rics like daywise revenue, CTR etc

  12. [12]

    Query-Clickstream logs : Logs each query and infor- mation regarding query’s downstream sessions like prod- ucts seen (impressions), clicked, added to cart, wish- listed, liked, purchased etc

  13. [13]

    This section focuses on various features and target variables we used for training our LETOR models

    MODEL Learning to rank is a popular approach that provides a prin- cipled way to optimize ranking of search results given various features. This section focuses on various features and target variables we used for training our LETOR models. From modelling perspective, we tried 2 pointwise models - Ran- dom Forests and Gradient Boosting Model and 2 pairwis...

  14. [14]

    Query Features: These are features specific to query like total length of query, number of words, is brand (eg Nike, Tommy Hilfiger) present in query, is article type (eg Dresses, Shoes) present in query, the identified article type, brand etc

  15. [15]

    Product Features : These are features specific to the products (documents) They can either be popu- larity related or physical features. The popularity fea- tures include features involving past performance of the product’s brand or article type (hereafter referred to as entity) like revenue in 15 days, quantity sold in 15 days etc. It is worth mentioning ...

  16. [16]

    bat- man printed tshirt

    Query Product Features The query product fea- tures are features which involve both query and prod- uct - eg ctr of a tshirt product when the query is “bat- man printed tshirt”. The query product features can again be of 2 different types: popularity based and relevance based. The popularity features include past performance of product’s entity as a result...

  17. [17]

    Better the ranking, higher would be the CTR CTRqp = Cqp Iqp (5)

    Click-Through Rate It is the probability of clicking on the listpage (the page as a result of search query). Better the ranking, higher would be the CTR CTRqp = Cqp Iqp (5)

  18. [18]

    It is the perceived utility of click page

    Add to cart Rate It is the probability of adding a product to the cart post the click. It is the perceived utility of click page. ATCRqp = Bqp Cqp (6)

  19. [19]

    This can be considered as the overall satisfaction of the user

    Conversion It is the probability of purchasing a prod- uct from listpage. This can be considered as the overall satisfaction of the user. Convqp = Bqp Iqp (7)

  20. [20]

    RPIqp = Rqp Iqp (8)

    Revenue per Impression It refers to the overall business value (revenue) from each impression as a re- sult of the query. RPIqp = Rqp Iqp (8)

  21. [21]

    We have classified them into broad and narrow using on the model described in Section 3

    RESULTS 6.1 Dataset We randomly sampled 100k queries. We have classified them into broad and narrow using on the model described in Section 3. We resulted 13k broad queries and 87k narrow queries. We sampled queries in 80-20 proportion in strat- ified manner to collate train (broad and narrow) and test (broad and narrow) queries. 6.1.1 Train and Test Data W...

  22. [22]

    Letip be the predicted rank position for each product andii be the ideal rank position

    products for the query and compute relevance scores for each query-product as above. Letip be the predicted rank position for each product andii be the ideal rank position. Now for each query q, compute DCG = ∑ p 2relqp−1 log2(ip + 1) (10) IDCG = ∑ p 2relqp−1 log2(ii + 1) (11) NDCG = DCG IDCG (12) 6.2 Cross Target Learning In e-commerce, choosing one targ...

  23. [23]

    We proposed a notion of coherency score and used it to seg- ment queries into broad and narrow

    CONCLUSIONS We presented a framework for building LETOR models for an e-commerce platform - specifically for theunnamed queries. We proposed a notion of coherency score and used it to seg- ment queries into broad and narrow. We discussed the chal- lenges involved in feature representation (query, product and query-product) and target metrics (ctr,atcr,conv...

  24. [24]

    Did We Get It Right? Predicting Query Performance in E-commerce Search

    R. Kumar, M. Kumar, N. Shah, and C. Faloutsos, “Did we get it right? predicting query performance in e-commerce search,” arXiv preprint arXiv:1808.00239 , 2018

  25. [25]

    On application of learning to rank for e-commerce search,

    S. K. Karmaker Santu, P. Sondhi, and C. Zhai, “On application of learning to rank for e-commerce search,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval. ACM, 2017, pp. 475–484

  26. [26]

    Yahoo! learning to rank challenge overview,

    O. Chapelle and Y. Chang, “Yahoo! learning to rank challenge overview,” in Proceedings of the Learning to Rank Challenge , 2011, pp. 1–24

  27. [27]

    Advances in formal mod- els of search and search behaviour,

    L. Azzopardi and G. Zuccon, “Advances in formal mod- els of search and search behaviour,” in Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval . ACM, 2016, pp. 1–4

  28. [28]

    Learning to rank for information retrieval and natural language processing,

    H. Li, “Learning to rank for information retrieval and natural language processing,” Synthesis Lectures on Human Language Technologies, vol. 7, no. 3, pp. 1–121, 2014

  29. [29]

    Learning to rank for information re- trieval,

    T.-Y. Liu et al. , “Learning to rank for information re- trieval,” Foundations and Trends R⃝ in Information Re- trieval, vol. 3, no. 3, pp. 225–331, 2009

  30. [30]

    From ranknet to lambdarank to lamb- damart: An overview

    C. J. Burges, “From ranknet to lambdarank to lamb- damart: An overview.”

  31. [31]

    Diversifying search results,

    R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, “Diversifying search results,” in Proceedings of the sec- ond ACM international conference on web search and data mining . ACM, 2009, pp. 5–14

  32. [32]

    Towards a theory model for product search,

    B. Li, A. Ghose, and P. G. Ipeirotis, “Towards a theory model for product search,” in Proceedings of the 20th international conference on World wide web . ACM, 2011, pp. 327–336

  33. [33]

    En- hancing product search by best-selling prediction in e- commerce,

    B. Long, J. Bian, A. Dong, and Y. Chang, “En- hancing product search by best-selling prediction in e- commerce,” in Proceedings of the 21st ACM interna- tional conference on Information and knowledge man- agement. ACM, 2012, pp. 2479–2482

  34. [34]

    Learning latent vector spaces for product search,

    C. Van Gysel, M. de Rijke, and E. Kanoulas, “Learning latent vector spaces for product search,” in Proceedings of the 25th ACM International on Conference on Infor- mation and Knowledge Management . ACM, 2016, pp. 165–174

  35. [35]

    Latent dirichlet allocation based diversified retrieval for e-commerce search,

    J. Yu, S. Mohan, D. P. Putthividhya, and W.-K. Wong, “Latent dirichlet allocation based diversified retrieval for e-commerce search,” in Proceedings of the 7th ACM international conference on Web search and data min- ing. ACM, 2014, pp. 463–472

  36. [36]

    Narrow or broad?: Estimating subjec- tive specificity in exploratory search,

    K. Athukorala, A. Oulasvirta, D. G lowacka, J. Vreeken, and G. Jacucci, “Narrow or broad?: Estimating subjec- tive specificity in exploratory search,” in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management . ACM, 2014, pp. 819–828

  37. [37]

    Query ambigu- ity identification based on user behavior information,

    C. Luo, Y. Liu, M. Zhang, and S. Ma, “Query ambigu- ity identification based on user behavior information,” in Asia Information Retrieval Symposium . Springer, 2014, pp. 36–47

  38. [38]

    Decoding fashion contexts using word embeddings

    S. Arora and D. Warrier, “Decoding fashion contexts using word embeddings.”

  39. [39]

    Wordnet: a lexical database for english,

    G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM , vol. 38, no. 11, pp. 39–41, 1995