Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com

Amritpal Singh Gill; Eva Agapaki

arxiv: 2605.00353 · v1 · submitted 2026-05-01 · 💻 cs.IR

Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com

Eva Agapaki , Amritpal Singh Gill This is my paper

Pith reviewed 2026-05-09 19:14 UTC · model grok-4.3

classification 💻 cs.IR

keywords negative samplingcontrastive learningdense retrievalproduct searchLLM evaluatione-commercehard negativestaxonomy

0 comments

The pith

Taxonomy-based negative sampling and LLM judging improve dense retrieval accuracy for IKEA product search by 2.6 percent offline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to generate better negative examples for training dense retrieval models in e-commerce product search by using the hierarchical product taxonomy and attributes. It also employs an LLM to score relevance across candidates, creating scalable training data without relying solely on human labels. Offline experiments on real Canadian user queries show a 2.6 percent increase in average category accuracy. Live A/B testing on long-tail queries, however, detects no significant improvement in user engagement, which the authors link to high zero-click rates where over half of popular searches lead to no product interaction. This highlights that advances in retrieval training must be validated against actual user behavior patterns to affect production outcomes.

Core claim

By mining structured negatives from the product taxonomy and using LLM-as-a-judge to assign relevance scores, contrastive training of late-interaction dense retrievers achieves higher offline category accuracy, yet this does not produce measurable gains in online engagement metrics because a large fraction of user searches exhibit zero-click behavior independent of result quality.

What carries the argument

Structured negative sampling that draws on product hierarchical taxonomy and attributes to produce semantically challenging negatives, together with an LLM-based system that evaluates relevance for training data generation.

If this is right

Taxonomy-derived negatives yield more effective contrastive training than random sampling, improving offline retrieval quality.
LLM-based relevance evaluation scales the creation of labeled data for dense retrieval models.
Offline gains in category accuracy do not guarantee online engagement improvements when zero-click rates are high.
Retrieval systems benefit from incorporating real user search behavior, such as intent distributions and zero-click patterns, into evaluation and training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar taxonomy-guided negative mining could extend to other e-commerce platforms with hierarchical product catalogs.
Addressing zero-click queries may require techniques beyond ranking, such as query suggestion or result diversification.
The findings point to a general need in information retrieval to align offline metrics more closely with downstream user actions.

Load-bearing premise

The LLM-as-a-judge generates reliable relevance scores for use as training labels, and negatives derived from the taxonomy are meaningfully harder and more effective than random sampling.

What would settle it

Running a controlled online experiment that directly compares models trained with taxonomy negatives versus random negatives while measuring engagement metrics, or having human annotators independently rate the same pairs to check LLM score accuracy.

Figures

Figures reproduced from arXiv: 2605.00353 by Amritpal Singh Gill, Eva Agapaki.

**Figure 1.** Figure 1: (a) Training data generation pipeline consisting of view at source ↗

read the original abstract

Contrastive learning is a core component of modern retrieval systems, but its effectiveness heavily relies on the quality of negative examples used during training. In this work, we present a systematic approach to improving dense retrieval for IKEA product search through structured negative sampling strategies and scalable LLM-as-a-judge relevance evaluation. Building on IKEA Search Engine's late-interaction retrieval architectures, we introduce two key contributions: (1) structured negative sampling strategies that leverage product hierarchical taxonomy and product attributes to generate semantically challenging negatives, and (2) a comprehensive LLM-based evaluation methodology for generating training data. Rather than relying on sparse human annotations or random sampling, our LLM-based evaluation system allocates a score for all candidate products against each query. Our methodology achieves +2.6\% average category accuracy on offline real user query experiments on the Canada market. However, our A/B test on long-tail queries showed no statistically significant differences in user engagement metrics between the improved and baseline models ($p > 0.05$). We trace this gap to user search behavior: 67\% of popular searches exhibit zero-click rates above 50\%, indicating that a substantial proportion of search sessions result in no product engagement regardless of result ranking. These findings underscore the importance of hard negative mining but also the need for grounding training data and offline evals in real user search behavior -- including query intent distribution and zero-click patterns -- to bridge the gap between offline retrieval quality and online user engagement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The IKEA paper shows a modest offline gain from taxonomy negatives but the null A/B on long-tail queries isn't convincingly tied to popular-search zero-click stats.

read the letter

The main takeaway is that the authors report a +2.6% lift in offline category accuracy after using product taxonomy and attributes for negative sampling, plus LLM scoring to build training data. Their A/B test on long-tail queries, however, showed no significant change in engagement metrics (p > 0.05), and they link this to high zero-click rates in searches. They are straightforward about both the gain and the null result, which is useful for anyone running retrieval in e-commerce catalogs where offline metrics often fail to predict live behavior. The work applies established ideas—taxonomy-derived hard negatives and LLM-as-judge evaluation—to IKEA's data and late-interaction models, with supporting user behavior numbers. That honesty about the offline-online gap is the clearest value here. The soft spot is the explanation for the null A/B. The 67% zero-click figure is drawn from popular searches, but the test used long-tail queries, where users typically have more specific intent and might respond to better ranking. The mismatch means the zero-click pattern may not account for the lack of lift; it could instead reflect a small effect size or metrics that don't pick up the change. The abstract also leaves out concrete details on how negatives were built and how the LLM was prompted, which makes the training data quality hard to assess. This is the sort of paper that helps practitioners see what happens when standard negative mining techniques meet a real retail catalog and live traffic. It does not introduce new methods or frameworks. I would send it to peer review because the empirical results and the attempt to connect offline gains to user behavior patterns give referees something concrete to evaluate, even if the interpretation needs work.

Referee Report

2 major / 0 minor

Summary. The paper proposes structured negative sampling using product taxonomy and attributes, combined with an LLM-as-a-judge system for relevance scoring, to improve contrastive learning for dense retrieval in IKEA's product search engine. It reports a +2.6% gain in average category accuracy on offline experiments with real user queries from the Canada market, but finds no statistically significant lift in user engagement metrics during an A/B test on long-tail queries (p > 0.05), which the authors attribute to high zero-click rates (67% of popular searches above 50%) in real user behavior.

Significance. If the central claims hold after clarification, the work provides practical value for e-commerce retrieval by demonstrating domain-specific hard-negative mining and the necessity of aligning offline metrics with observed user patterns such as zero-click behavior. The inclusion of live A/B testing alongside offline category accuracy offers a grounded empirical assessment that could inform similar industrial systems, though the offline-online discrepancy highlights broader challenges in translating retrieval gains to engagement.

major comments (2)

[Abstract] Abstract: The attribution of the null A/B result on long-tail queries to the statistic that '67% of popular searches exhibit zero-click rates above 50%' does not directly support the synthesis, because the cited zero-click pattern applies to popular searches while the A/B test explicitly uses long-tail queries; long-tail queries may exhibit different intent distributions and lower zero-click rates, leaving open alternative explanations such as insufficient effect size or metric choice.
[Abstract] Abstract and methodology description: The central +2.6% category accuracy claim depends on the LLM-as-a-judge producing reliable relevance scores and on taxonomy-derived negatives being meaningfully harder than random sampling, yet the manuscript provides insufficient detail on the exact prompting strategy, scoring rubric, and negative construction procedure, preventing verification that these choices are load-bearing for the reported improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important issues of precision in the abstract and the need for greater methodological transparency. We have revised the manuscript to address both points directly.

read point-by-point responses

Referee: [Abstract] Abstract: The attribution of the null A/B result on long-tail queries to the statistic that '67% of popular searches exhibit zero-click rates above 50%' does not directly support the synthesis, because the cited zero-click pattern applies to popular searches while the A/B test explicitly uses long-tail queries; long-tail queries may exhibit different intent distributions and lower zero-click rates, leaving open alternative explanations such as insufficient effect size or metric choice.

Authors: We agree that the original phrasing created an imprecise link between the zero-click statistic (derived from popular searches) and the A/B test results (on long-tail queries). This leaves open the possibility of other factors, including effect size or metric sensitivity. In the revised abstract and discussion, we have removed the direct attribution, instead presenting the 67% figure as contextual background on overall user behavior at IKEA while explicitly noting that long-tail queries may differ and that further query-specific analysis would be valuable. No new data on long-tail zero-click rates is added, as it was not collected in this study. revision: yes
Referee: [Abstract] Abstract and methodology description: The central +2.6% category accuracy claim depends on the LLM-as-a-judge producing reliable relevance scores and on taxonomy-derived negatives being meaningfully harder than random sampling, yet the manuscript provides insufficient detail on the exact prompting strategy, scoring rubric, and negative construction procedure, preventing verification that these choices are load-bearing for the reported improvement.

Authors: We concur that the original manuscript lacked sufficient detail on these components, which are central to reproducing and validating the +2.6% gain. The revised manuscript now includes an expanded methodology section with: (1) the complete LLM prompt templates and system instructions used for relevance scoring, (2) the full scoring rubric (a 0-5 scale with explicit criteria for each level based on query-product semantic match, attribute alignment, and category relevance), and (3) the precise algorithm for constructing taxonomy-based negatives (including selection of sibling categories in the hierarchy, attribute mismatch sampling, and exclusion of random negatives for comparison). These additions make the load-bearing role of the structured approach verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical results

full rationale

The paper reports empirical outcomes from offline real-user query experiments (+2.6% category accuracy) and an A/B test on long-tail queries (no significant engagement lift, p>0.05), attributing the discrepancy to observed zero-click statistics on popular searches. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the presented chain. The methodology relies on structured negative sampling, LLM-based scoring, and live testing without any step that reduces by construction to its own inputs. This is a standard empirical retrieval paper whose central claims are externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contributions rest on the effectiveness of taxonomy and attribute-based negative selection and the accuracy of LLM scoring, both of which are domain assumptions rather than derived from first principles or external benchmarks.

axioms (1)

domain assumption The LLM-as-a-judge system produces reliable relevance scores that can replace or augment human annotations for training data generation.
The methodology relies on this to create comprehensive training data without mentioning calibration against human judgments.

pith-pipeline@v0.9.0 · 5558 in / 1397 out tokens · 86176 ms · 2026-05-09T19:14:33.100876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 13 canonical work pages

[1]

Nan Bi, Pablo Castells, Daniel Gilbert, Slava Galperin, Patrick Tardif, and Sachin Ahuja. 2022. Debiased Balanced Interleaving at Amazon Search. InProceedings of the 31st ACM International Conference on Information and Knowledge Management. 3798–3802. doi:10.1145/3511808.3557123

work page doi:10.1145/3511808.3557123 2022
[2]

Amritpal Singh Gill, Sannikumar Patel, Péter Varga, Patrick Miller, and Sakis Athanasiadis. 2025. From Keywords to Concepts: A Late Interaction Approach to Semantic Product Search on IKEA.com. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025). ACM

2025
[3]

Guangda Huzhang, Zhen-Jia Pang, Yongqing Gao, Yawen Liu, Weijie Shen, Wen- Ji Zhou, Qing Da, Anxiang Zeng, Han Yu, Yang Yu, and Zhi-Hua Zhou. 2021. AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online.IEEE Transactions on Knowledge and Data Engineering34, 8 (2021), 3941–3954. doi:10.1109/TKDE.2020.3039758

work page doi:10.1109/tkde.2020.3039758 2021
[4]

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard Negative Mixing for Contrastive Learning. InAdvances in Neural Information Processing Systems, Vol. 33. 21798–21809

2020
[5]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. doi:10.18653/v1/...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[6]

Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based Product Retrieval in Taobao Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3181–3189. doi:10.1145/3447548.3467101

work page doi:10.1145/3447548.3467101 2021
[7]

Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra, Tony Lee, Alessandro Mag- nani, and Ciya Liao

Juexin Lin, Sachin Yadav, Feng Liu, Nicholas Rossi, Praveen R. Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra, Tony Lee, Alessandro Mag- nani, and Ciya Liao. 2024. Enhancing Relevance of Embedding-based Retrieval at Walmart. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 1359–1368. doi:10.114...

work page doi:10.1145/3627673.3680047 2024
[8]

Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, and Ciya Liao. 2022. Semantic Retrieval at Walmart. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3495–3503. doi:10.1145/3534678.3539164

work page doi:10.1145/3534678.3539164 2022
[9]

Nv-retriever: Improving text embedding models with effective hard-negative mining

Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. 2024. NV-Retriever: Improving Text Embedding Models with Effective Hard-Negative Mining.arXiv preprint arXiv:2407.15831 (2024)

work page arXiv 2024
[10]

Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra

Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. Towards Understanding Bias in Synthetic Data for Evaluation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. doi:10.1145/3746252.3760908

work page doi:10.1145/3746252.3760908 2025
[11]

Rosario, Abhijeet Phatak, He Wen, Swati Kirti, and Chittaranjan Tripathy

Jayant Sachdev, Sean D. Rosario, Abhijeet Phatak, He Wen, Swati Kirti, and Chittaranjan Tripathy. 2025. Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search. InProceedings of NLPIR

2025
[12]

doi:10.1145/3711542.3711582

32–40. doi:10.1145/3711542.3711582

work page doi:10.1145/3711542.3711582
[13]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. arXiv:2112.01488 [cs.IR] https://arxiv.org/abs/2112.01488

work page arXiv 2022
[14]

Sriram Somanchi, Ahmed Abbasi, Ken Kelley, David Dobolyi, and Ted Tao Yuan
[15]

doi:10.1145/3578931

Examining User Heterogeneity in Digital Experiments.ACM Transactions on Information Systems41, 4 (2023), 1–28. doi:10.1145/3578931

work page doi:10.1145/3578931 2023
[16]

Xiaojie Wang, Ruoyuan Gao, Anoop Jain, Graham Edge, and Sachin Ahuja. 2023. How Well do Offline Metrics Predict Online Performance of Product Ranking Models?. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2349–2354. doi:10.1145/3539618. 3591865

work page doi:10.1145/3539618 2023
[17]

Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuancheng Li, Jiaji Li, Xuesong Chen, Min Zhang, and Shaoping Ma. 2022. Why Don’t You Click: Understanding Non-Click Results in Web Search with Brain Signals. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 233–243. doi:10.1145/3477495.3532082

work page doi:10.1145/3477495.3532082 2022

[1] [1]

Nan Bi, Pablo Castells, Daniel Gilbert, Slava Galperin, Patrick Tardif, and Sachin Ahuja. 2022. Debiased Balanced Interleaving at Amazon Search. InProceedings of the 31st ACM International Conference on Information and Knowledge Management. 3798–3802. doi:10.1145/3511808.3557123

work page doi:10.1145/3511808.3557123 2022

[2] [2]

Amritpal Singh Gill, Sannikumar Patel, Péter Varga, Patrick Miller, and Sakis Athanasiadis. 2025. From Keywords to Concepts: A Late Interaction Approach to Semantic Product Search on IKEA.com. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025). ACM

2025

[3] [3]

Guangda Huzhang, Zhen-Jia Pang, Yongqing Gao, Yawen Liu, Weijie Shen, Wen- Ji Zhou, Qing Da, Anxiang Zeng, Han Yu, Yang Yu, and Zhi-Hua Zhou. 2021. AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online.IEEE Transactions on Knowledge and Data Engineering34, 8 (2021), 3941–3954. doi:10.1109/TKDE.2020.3039758

work page doi:10.1109/tkde.2020.3039758 2021

[4] [4]

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard Negative Mixing for Contrastive Learning. InAdvances in Neural Information Processing Systems, Vol. 33. 21798–21809

2020

[5] [5]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. doi:10.18653/v1/...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[6] [6]

Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based Product Retrieval in Taobao Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3181–3189. doi:10.1145/3447548.3467101

work page doi:10.1145/3447548.3467101 2021

[7] [7]

Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra, Tony Lee, Alessandro Mag- nani, and Ciya Liao

Juexin Lin, Sachin Yadav, Feng Liu, Nicholas Rossi, Praveen R. Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra, Tony Lee, Alessandro Mag- nani, and Ciya Liao. 2024. Enhancing Relevance of Embedding-based Retrieval at Walmart. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 1359–1368. doi:10.114...

work page doi:10.1145/3627673.3680047 2024

[8] [8]

Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, and Ciya Liao. 2022. Semantic Retrieval at Walmart. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3495–3503. doi:10.1145/3534678.3539164

work page doi:10.1145/3534678.3539164 2022

[9] [9]

Nv-retriever: Improving text embedding models with effective hard-negative mining

Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. 2024. NV-Retriever: Improving Text Embedding Models with Effective Hard-Negative Mining.arXiv preprint arXiv:2407.15831 (2024)

work page arXiv 2024

[10] [10]

Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra

Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. Towards Understanding Bias in Synthetic Data for Evaluation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. doi:10.1145/3746252.3760908

work page doi:10.1145/3746252.3760908 2025

[11] [11]

Rosario, Abhijeet Phatak, He Wen, Swati Kirti, and Chittaranjan Tripathy

Jayant Sachdev, Sean D. Rosario, Abhijeet Phatak, He Wen, Swati Kirti, and Chittaranjan Tripathy. 2025. Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search. InProceedings of NLPIR

2025

[12] [12]

doi:10.1145/3711542.3711582

32–40. doi:10.1145/3711542.3711582

work page doi:10.1145/3711542.3711582

[13] [13]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. arXiv:2112.01488 [cs.IR] https://arxiv.org/abs/2112.01488

work page arXiv 2022

[14] [14]

Sriram Somanchi, Ahmed Abbasi, Ken Kelley, David Dobolyi, and Ted Tao Yuan

[15] [15]

doi:10.1145/3578931

Examining User Heterogeneity in Digital Experiments.ACM Transactions on Information Systems41, 4 (2023), 1–28. doi:10.1145/3578931

work page doi:10.1145/3578931 2023

[16] [16]

Xiaojie Wang, Ruoyuan Gao, Anoop Jain, Graham Edge, and Sachin Ahuja. 2023. How Well do Offline Metrics Predict Online Performance of Product Ranking Models?. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2349–2354. doi:10.1145/3539618. 3591865

work page doi:10.1145/3539618 2023

[17] [17]

Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuancheng Li, Jiaji Li, Xuesong Chen, Min Zhang, and Shaoping Ma. 2022. Why Don’t You Click: Understanding Non-Click Results in Web Search with Brain Signals. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 233–243. doi:10.1145/3477495.3532082

work page doi:10.1145/3477495.3532082 2022