pith. sign in

arxiv: 1906.12120 · v1 · pith:Z6EFZWOZnew · submitted 2019-06-28 · 💻 cs.LG · cs.IR· stat.ML

One Embedding To Do Them All

Pith reviewed 2026-05-25 13:40 UTC · model grok-4.3

classification 💻 cs.LG cs.IRstat.ML
keywords product embeddingsmulti-source learninge-commercedenoising autoencoderBayesian personalized rankingSiamese networkunified representationsclickstream data
0
0 comments X

The pith

Unified embeddings from text, clicks and images perform well on attribute coverage, similarity and return prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework that learns one set of product embeddings by drawing on catalog text, user clickstream sessions and product images at the same time. Separate models are trained on each data type using denoising auto-encoders, Bayesian personalized ranking and a Siamese network, after which their embeddings are combined. The resulting unified embeddings are then tested on three unrelated real-world tasks: checking how well products are described by attributes, finding similar products and predicting returns. The authors report that the single embedding set delivers strong results across all three tasks. A reader would care because current practice usually trains and stores separate representations for each function, so a shared embedding could reduce duplication while maintaining performance.

Core claim

By training independent models on catalog text with denoising auto-encoders, on clickstream data with Bayesian personalized ranking and on images with a Siamese network, then forming an ensemble of the resulting embeddings, a unified product representation is obtained that performs uniformly well on product attribute coverage, similar-product retrieval and return prediction without further task-specific training.

What carries the argument

The ensemble that combines embeddings produced separately by a denoising auto-encoder on text, Bayesian personalized ranking on clickstream sessions and a Siamese network on images.

If this is right

  • A single embedding can be used for search, recommendation and operational tasks instead of maintaining separate models.
  • Training occurs once on the product catalog rather than once per downstream task.
  • Performance remains consistent even when the tasks share little overlap in their objectives.
  • Serving infrastructure simplifies because only one embedding table needs to be stored and queried.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same independent-model-plus-ensemble pattern could be applied to additional data types such as customer reviews or video if comparable source models exist.
  • Production systems that currently run multiple embedding services might reduce memory and lookup latency by switching to one unified table.
  • If new tasks are introduced later, the ensemble weights may need re-balancing, which could be tested by adding a fourth task and measuring whether uniform performance holds.

Load-bearing premise

The three source-specific models can be trained independently on the same catalog and then combined without the ensemble step introducing bias toward any one task or requiring per-task hyper-parameter search.

What would settle it

Running the same three tasks on a held-out catalog where a task-specific model trained only for return prediction clearly outperforms the unified embedding on that task alone.

Figures

Figures reproduced from arXiv: 1906.12120 by Loveperteek Singh, Sagar Arora, Shreya Singh, Sumit Borar.

Figure 1
Figure 1. Figure 1: Different Techniques to Learn Product Embeddings This section describes different ways to learn product em￾beddings. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Autoencoder Architecture [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prod2Vec Architecture each product(in bag and purchased) as center word in the list we sample all other product in the list as context words. This is equivalent to generating all product-product(centre￾context) pairs from the list and setting window size to one. The latent representations of the products are learned using the Skip Gram with Negative Sampling model. We sam￾ple negative samples randomly from… view at source ↗
Figure 4
Figure 4. Figure 4: Most Similar Products to a given Query using different Embeddings [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Precision at different values of k for different attributes [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-SNE plot showing Brand Clusters Green, Lacoste and Tommy Hilfiger. Finally, a cluster in￾cluded a few brands (of slightly mass-premium price range) like Roadster, Here&Now and Moda Rapido. This clearly shows that embeddings are able to capture brand semantics fairly well so as to be able to capture user perception of brands. 4.3 Embeddings to Attributes This task attempts to evaluate learnt embeddings on… view at source ↗
Figure 9
Figure 9. Figure 9: Hit Ratio at different K values 4.6 Cart Return Prediction Cart return prediction is unrelated downstream tasks with which evaluated our embeddings. In this task, we aim to predict users’ propensity for returning product(s) from a cart at the time of purchase. Returns ensue bad user ex￾perience apart from extra operational costs incurred on the platform. As per our analysis, a product which is added to the… view at source ↗
read the original abstract

Online shopping caters to the needs of millions of users daily. Search, recommendations, personalization have become essential building blocks for serving customer needs. Efficacy of such systems is dependent on a thorough understanding of products and their representation. Multiple information sources and data types provide a complete picture of the product on the platform. While each of these tasks shares some common characteristics, typically product embeddings are trained and used in isolation. In this paper, we propose a framework to combine multiple data sources and learn unified embeddings for products on our e-commerce platform. Our product embeddings are built from three types of data sources - catalog text data, a user's clickstream session data and product images. We use various techniques like denoising auto-encoders for text, Bayesian personalized ranking (BPR) for clickstream data, Siamese neural network architecture for image data and combined ensemble over the above methods for unified embeddings. Further, we compare and analyze the performance of these embeddings across three unrelated real-world e-commerce tasks specifically checking product attribute coverage, finding similar products and predicting returns. We show that unified product embeddings perform uniformly well across all these tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a framework for learning unified product embeddings on an e-commerce platform by training three independent models—denoising auto-encoders on catalog text, Bayesian personalized ranking on user clickstream sessions, and Siamese networks on product images—then combining them via an ensemble. It evaluates these embeddings on three tasks (product attribute coverage, similar-product retrieval, and return prediction) and claims that the unified embeddings 'perform uniformly well across all these tasks' without task-specific adaptation.

Significance. If the uniformity claim were supported by rigorous, held-out evaluations with fixed ensemble parameters, the work would offer a practical demonstration that multi-modal product representations can reduce the need for per-task embedding training in e-commerce systems. The use of standard techniques (DAE, BPR, Siamese) on real catalog data is a reasonable starting point, but the manuscript supplies no quantitative evidence, baselines, or protocol details to substantiate the central claim.

major comments (3)
  1. [Abstract] Abstract: the claim that unified embeddings 'perform uniformly well across all these tasks' is unsupported; the text supplies no quantitative tables, baselines, statistical tests, held-out evaluation protocol, or dataset sizes, making it impossible to assess the uniformity result.
  2. [Methods] Methods (ensemble description): the 'combined ensemble' step is described only at the level of 'combined ensemble over the above methods' with no specification of the fusion operation (concatenation, weighted sum, learned projection, etc.) or whether any meta-parameters or weights are held fixed across the three downstream tasks; this directly undermines the task-agnostic claim.
  3. [Evaluation] Evaluation protocol: embeddings are learned from the same clickstream and catalog data later used to measure attribute coverage and return prediction, with no explicit statement of disjoint train/test splits or external benchmarks; the reported gains are therefore consistent with in-sample fitting rather than generalization.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'three unrelated real-world e-commerce tasks' would benefit from a brief parenthetical listing of the tasks for immediate clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate clarifications and additional details where needed.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that unified embeddings 'perform uniformly well across all these tasks' is unsupported; the text supplies no quantitative tables, baselines, statistical tests, held-out evaluation protocol, or dataset sizes, making it impossible to assess the uniformity result.

    Authors: We agree the abstract claim would benefit from supporting quantitative context. The full manuscript presents per-task results in the evaluation section, but to strengthen the presentation we will revise the abstract to include a concise summary of key metrics (e.g., relative improvements on attribute coverage, retrieval, and return prediction) along with dataset sizes and a reference to the held-out protocol described in Section 4. revision: yes

  2. Referee: [Methods] Methods (ensemble description): the 'combined ensemble' step is described only at the level of 'combined ensemble over the above methods' with no specification of the fusion operation (concatenation, weighted sum, learned projection, etc.) or whether any meta-parameters or weights are held fixed across the three downstream tasks; this directly undermines the task-agnostic claim.

    Authors: We will expand the methods section to specify the fusion: the three modality-specific embeddings are concatenated and passed through a single linear projection layer whose weights are learned once on a validation split and then frozen for all downstream tasks. This fixed-parameter design directly supports the task-agnostic claim; the revised text will include the exact fusion equation and confirmation that no task-specific re-tuning occurs. revision: yes

  3. Referee: [Evaluation] Evaluation protocol: embeddings are learned from the same clickstream and catalog data later used to measure attribute coverage and return prediction, with no explicit statement of disjoint train/test splits or external benchmarks; the reported gains are therefore consistent with in-sample fitting rather than generalization.

    Authors: We will add an explicit evaluation-protocol subsection clarifying the temporal and product-level splits used: embeddings are trained on data up to a cutoff date, attribute-coverage and retrieval evaluations use held-out products, and return prediction uses future sessions after the cutoff. We will also state the sizes of the disjoint sets and note any external benchmarks. These details were present in our internal protocol but omitted from the manuscript; the revision will make them explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline: independent training of DAE on text, BPR on clickstream, and Siamese on images, followed by an ensemble whose fusion method is unspecified, then evaluation on attribute coverage, similar-product retrieval, and return prediction. No equations, uniqueness theorems, or derivation steps are presented in the abstract or described text that reduce a claimed result to its inputs by construction. No self-citation load-bearing premises or ansatz smuggling appear. The central claim is therefore an empirical observation rather than a closed-form derivation, making circularity analysis inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unstated premise that the three data modalities are complementary and that standard off-the-shelf losses (reconstruction, ranking, contrastive) can be combined without new theoretical justification; no free parameters are explicitly introduced beyond those internal to the cited algorithms.

axioms (1)
  • domain assumption Product representations learned from one modality transfer to tasks defined on other modalities without additional alignment loss.
    Invoked when the ensemble is claimed to work uniformly on attribute, similarity, and return tasks.

pith-pipeline@v0.9.0 · 5730 in / 1197 out tokens · 40717 ms · 2026-05-25T13:40:00.326644+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 5 internal anchors

  1. [1]

    Matching consumer’s need and retrieving relevant products is pivotal to the business

    INTRODUCTION E-commerce is growing at a phenomenal rate around the world. Matching consumer’s need and retrieving relevant products is pivotal to the business. This has led to a lot of research in areas of search, recommendation systems, per- sonalization, demand prediction etc. For all these tasks, de- tailed understanding of product and users become ext...

  2. [2]

    Product titles are structured and the average length of product title is 7.3 words

    Textual Data: This involves products’ title (name), description and cataloged attributes like brand, color, fabric and physical attributes like neck, pattern etc. Product titles are structured and the average length of product title is 7.3 words. Product descriptions vary a lot based on the products and contain both structured and unstructured information...

  3. [3]

    These signals are good indicators for visibility and popularity of products on the platform

    Clickstream Data: This includes all the users’ ses- sions and the involved interactions including searches, impressions, clicks, sorts and, filters used, add to carts, purchases etc. These signals are good indicators for visibility and popularity of products on the platform

  4. [4]

    One Embedding To Do Them All

    Visual Data: This includes product images available in the catalog. Each product on an average is repre- sented by at least 4 images. These images are mostly shot in a controlled setting with solid color background and model poses. Our work focuses on capturing a wider variety of signals from various data sources (as mentioned above) to embed all products...

  5. [5]

    Embedding to Attribute : This task attempts to evaluate learned embeddings on how well they can cap- ture the products’ textual attributes like brand, color etc

  6. [6]

    We show how our unified embeddings are able to better capture the sim- ilarity

    Clicked-Purchased Product Similarity: we com- pute the similarity of the purchased product in a ses- sion with those which were clicked. We show how our unified embeddings are able to better capture the sim- ilarity

  7. [7]

    Hence, through cart return predic- tion, we aim to identify the cart products which have a high probability of being returned and take corrective actions

    Cart Return Prediction : Returns ensue bad user experience apart from extra operational costs incurred by our platform. Hence, through cart return predic- tion, we aim to identify the cart products which have a high probability of being returned and take corrective actions. This task involves using product embeddings to predict if a user u would return a ...

  8. [8]

    For implicit feedback setting, in- terpreting unobserved feedback poses a challenge

    RELATED WORK Traditionally, product representations have been learned through Matrix Factorization and related approaches [9, 16] which use only user’s feedback. For implicit feedback setting, in- terpreting unobserved feedback poses a challenge. [9] in- terprets unobserved feedback to be negative thereby asso- ciating weights with feedback and factorize ...

  9. [9]

    As shown in Figure 1 we evaluate embeddings learned from different data sources-

    METHODOLOGY Figure 1: Different Techniques to Learn Product Embeddings This section describes different ways to learn product em- beddings. As shown in Figure 1 we evaluate embeddings learned from different data sources-

  10. [10]

    Clickstream Data: BPR-MF, Prod2Vec and DeepWalk- Prod2Vec

  11. [11]

    Content Data (Catalogue and Image): Denoising Au- toencoder and Image Embeddings

  12. [12]

    Table 1 describes the terminology used

    Clickstream and Content Data: ProdSI2Vec (ProductSide- Information2Vec), DeepWalk-ProdSI2Vec and Unified Embeddings In addition to using user’s lifetime data, we also compare the performance of Prod2Vec and Prod-SI2Vec with graph based embeddings learned from a platform level item-item graph. Table 1 describes the terminology used. Symbol Meaning U the set...

  13. [13]

    Brand:Nike, Puma, Adidas,

  14. [14]

    BaseColor: Black, Red, Blue, Green,

  15. [15]

    Fabric: Cotton, Polyester, Blended,

  16. [16]

    Priceband: 0-500, 500-1000, 1000-1500, ...., 3000+

  17. [17]

    Neck: Round Neck, Polo Collar, V-neck,

  18. [18]

    In this approach, alongwith the product-product pairs we also generate product-SI pairs and SI-SI pairs to be input to the Word2Vec model

    Pattern: Printed, Solid, Striped, Colorblocked, .... In this approach, alongwith the product-product pairs we also generate product-SI pairs and SI-SI pairs to be input to the Word2Vec model. For each (centre-product, context- product) pair, we generate the following tuples:

  19. [19]

    (Pcentre,PSIcentre), for each SI of the centre product

  20. [20]

    (Pcentre,PSIcontext), for each SI of the context product

  21. [21]

    Thus we also learn vectors for each of those key-value pair from SI

    (PSIcentre,PSIcontext), for each (SI,SI) pair from centre and context products By doing so we have increased vocabulary size from total number of products to total number products plus the total number of SI key-value pairs. Thus we also learn vectors for each of those key-value pair from SI. 3.4.3 DeepWalk-Prod2V ec and DeepWalk-ProdSI2V ec DeepWalk was ...

  22. [22]

    Unifying Embeddings from ProdSI2Vec and Images

  23. [23]

    The weights are learned us- ing grid search on the cross-validation dataset of the down- stream task we use the embeddings for

    Unifying Embeddings from DeepWalk-ProdSI2Vec and Images We propose a simple weighted average to unify these em- beddings: γp =wI·γpI +wPSV ·γpP SV (9) whereγpI are image embeddings and wI is the weight asso- ciated with them, γpP SV are Word2Vec based embeddings (ProdSI2Vec or DeepWalk-ProdSI2Vec) and wPSV is the weight associated with them. The weights a...

  24. [24]

    The generalizability of embeddings implies that they be able to capture all the signals which effect tastes of a user

    RESULTS We evaluate the performance of all the nine embeddings on three different tasks, which chosen to be varied enough so as to be able to check the generalizability of embeddings. The generalizability of embeddings implies that they be able to capture all the signals which effect tastes of a user. Table 2 shows nine types of product embeddings which are...

  25. [25]

    CONCLUSION We propose a framework to combine multiple data sources - catalog text data, user’s clickstream session data, and product images and generate a unified representation of all products in a product semantic space . We utilized various state-of-art techniques like denoising auto-encoders for text, Bayesian personalized ranking (BPR) for clickstream...

  26. [26]

    Personalizing Similar Product Recommendations in Fashion E-commerce

    Agarwal, P., Vempati, S., and Borar, S. Person- alizing similar product recommendations in fashion e- commerce. arXiv preprint arXiv:1806.11371 (2018)

  27. [27]

    Deciphering fashion sensibility using community de- tection

    Arora, S., Madvariya, A., Alok, D., and Borar, S. Deciphering fashion sensibility using community de- tection. KDDW on ML meets fashion (2017)

  28. [28]

    Decoding fashion con- texts using word embeddings

    Arora, S., and W arrier, D. Decoding fashion con- texts using word embeddings. In KDD Workshop on Machine learning meets fashion (2016)

  29. [29]

    Real-time personaliza- tion using embeddings for search ranking at airbnb

    Grbovic, M., and Cheng, H. Real-time personaliza- tion using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), ACM, pp. 311–320

  30. [30]

    E-commerce in your inbox: Product recom- mendations at scale

    Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., and Sharp, D. E-commerce in your inbox: Product recom- mendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining (2015), ACM, pp. 1809–1818

  31. [31]

    node2vec: Scalable feature learning for networks

    Grover, A., and Leskovec, J. node2vec: Scalable feature learning for networks. InProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (2016), ACM, pp. 855–864

  32. [32]

    Ups and downs: Model- ing the visual evolution of fashion trends with one-class collaborative filtering

    He, R., and McAuley, J. Ups and downs: Model- ing the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th inter- national conference on world wide web (2016), Interna- tional World Wide Web Conferences Steering Commit- tee, pp. 507–517

  33. [33]

    Vbpr: Visual bayesian personalized ranking from implicit feedback

    He, R., and McAuley, J. Vbpr: Visual bayesian personalized ranking from implicit feedback. In AAAI (2016), pp. 144–150

  34. [34]

    Collaborative filtering for implicit feedback datasets

    Hu, Y., Koren, Y., and Volinsky, C. Collaborative filtering for implicit feedback datasets. In Data Mining,

  35. [35]

    Eighth IEEE International Conference on (2008), Ieee, pp

    ICDM’08. Eighth IEEE International Conference on (2008), Ieee, pp. 263–272

  36. [36]

    Visually-aware fashion recommendation and design with generative image models

    Kang, W.-C., F ang, C., W ang, Z., and McAuley, J. Visually-aware fashion recommendation and design with generative image models. InData Mining (ICDM), 2017 IEEE International Conference on (2017), IEEE, pp. 207–216

  37. [37]

    Efficient Large-Scale Multi-Modal Classification

    Kiela, D., Grave, E., Joulin, A., and Mikolov, T. Efficient large-scale multi-modal classification.arXiv preprint arXiv:1802.02892 (2018)

  38. [38]

    Neural word embedding as implicit matrix factorization

    Levy, O., and Goldberg, Y. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems (2014), pp. 2177–2185

  39. [39]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  40. [40]

    Specializing Joint Representations for the task of Product Recommendation

    Nedelec, T., Smirnova, E., and V asile, F. Spe- cializing joint representations for the task of prod- uct recommendation. arXiv preprint arXiv:1706.07625 (2017)

  41. [41]

    Deepwalk: Online learning of social representations

    Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014), ACM, pp. 701–710

  42. [42]

    Bpr: Bayesian personalized rank- ing from implicit feedback

    Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. Bpr: Bayesian personalized rank- ing from implicit feedback. InProceedings of the twenty- fifth conference on uncertainty in artificial intelligence (2009), AUAI Press, pp. 452–461

  43. [43]

    The distributional hypothesis

    Sahlgren, M. The distributional hypothesis. Italian Journal of Disability Studies 20 (2008), 33–53

  44. [44]

    Line: Large-scale information net- work embedding

    Tang, J., Qu, M., W ang, M., Zhang, M., Yan, J., and Mei, Q. Line: Large-scale information net- work embedding. In Proceedings of the 24th Interna- tional Conference on World Wide Web (2015), Inter- national World Wide Web Conferences Steering Com- mittee, pp. 1067–1077

  45. [45]

    Meta- prod2vec: Product embeddings using side-information for recommendation

    V asile, F., Smirnova, E., and Conneau, A. Meta- prod2vec: Product embeddings using side-information for recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems (2016), ACM, pp. 225–232

  46. [46]

    Extracting and composing robust features with denoising autoencoders

    Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (2008), ACM, pp. 1096–1103

  47. [47]

    Learn- ing fine-grained image similarity with deep ranking

    W ang, J., Song, Y., Leung, T., Rosenberg, C., W ang, J., Philbin, J., Chen, B., and Wu, Y. Learn- ing fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition (2014), pp. 1386–1393