Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com
Pith reviewed 2026-05-09 19:14 UTC · model grok-4.3
The pith
Taxonomy-based negative sampling and LLM judging improve dense retrieval accuracy for IKEA product search by 2.6 percent offline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mining structured negatives from the product taxonomy and using LLM-as-a-judge to assign relevance scores, contrastive training of late-interaction dense retrievers achieves higher offline category accuracy, yet this does not produce measurable gains in online engagement metrics because a large fraction of user searches exhibit zero-click behavior independent of result quality.
What carries the argument
Structured negative sampling that draws on product hierarchical taxonomy and attributes to produce semantically challenging negatives, together with an LLM-based system that evaluates relevance for training data generation.
If this is right
- Taxonomy-derived negatives yield more effective contrastive training than random sampling, improving offline retrieval quality.
- LLM-based relevance evaluation scales the creation of labeled data for dense retrieval models.
- Offline gains in category accuracy do not guarantee online engagement improvements when zero-click rates are high.
- Retrieval systems benefit from incorporating real user search behavior, such as intent distributions and zero-click patterns, into evaluation and training.
Where Pith is reading between the lines
- Similar taxonomy-guided negative mining could extend to other e-commerce platforms with hierarchical product catalogs.
- Addressing zero-click queries may require techniques beyond ranking, such as query suggestion or result diversification.
- The findings point to a general need in information retrieval to align offline metrics more closely with downstream user actions.
Load-bearing premise
The LLM-as-a-judge generates reliable relevance scores for use as training labels, and negatives derived from the taxonomy are meaningfully harder and more effective than random sampling.
What would settle it
Running a controlled online experiment that directly compares models trained with taxonomy negatives versus random negatives while measuring engagement metrics, or having human annotators independently rate the same pairs to check LLM score accuracy.
Figures
read the original abstract
Contrastive learning is a core component of modern retrieval systems, but its effectiveness heavily relies on the quality of negative examples used during training. In this work, we present a systematic approach to improving dense retrieval for IKEA product search through structured negative sampling strategies and scalable LLM-as-a-judge relevance evaluation. Building on IKEA Search Engine's late-interaction retrieval architectures, we introduce two key contributions: (1) structured negative sampling strategies that leverage product hierarchical taxonomy and product attributes to generate semantically challenging negatives, and (2) a comprehensive LLM-based evaluation methodology for generating training data. Rather than relying on sparse human annotations or random sampling, our LLM-based evaluation system allocates a score for all candidate products against each query. Our methodology achieves +2.6\% average category accuracy on offline real user query experiments on the Canada market. However, our A/B test on long-tail queries showed no statistically significant differences in user engagement metrics between the improved and baseline models ($p > 0.05$). We trace this gap to user search behavior: 67\% of popular searches exhibit zero-click rates above 50\%, indicating that a substantial proportion of search sessions result in no product engagement regardless of result ranking. These findings underscore the importance of hard negative mining but also the need for grounding training data and offline evals in real user search behavior -- including query intent distribution and zero-click patterns -- to bridge the gap between offline retrieval quality and online user engagement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes structured negative sampling using product taxonomy and attributes, combined with an LLM-as-a-judge system for relevance scoring, to improve contrastive learning for dense retrieval in IKEA's product search engine. It reports a +2.6% gain in average category accuracy on offline experiments with real user queries from the Canada market, but finds no statistically significant lift in user engagement metrics during an A/B test on long-tail queries (p > 0.05), which the authors attribute to high zero-click rates (67% of popular searches above 50%) in real user behavior.
Significance. If the central claims hold after clarification, the work provides practical value for e-commerce retrieval by demonstrating domain-specific hard-negative mining and the necessity of aligning offline metrics with observed user patterns such as zero-click behavior. The inclusion of live A/B testing alongside offline category accuracy offers a grounded empirical assessment that could inform similar industrial systems, though the offline-online discrepancy highlights broader challenges in translating retrieval gains to engagement.
major comments (2)
- [Abstract] Abstract: The attribution of the null A/B result on long-tail queries to the statistic that '67% of popular searches exhibit zero-click rates above 50%' does not directly support the synthesis, because the cited zero-click pattern applies to popular searches while the A/B test explicitly uses long-tail queries; long-tail queries may exhibit different intent distributions and lower zero-click rates, leaving open alternative explanations such as insufficient effect size or metric choice.
- [Abstract] Abstract and methodology description: The central +2.6% category accuracy claim depends on the LLM-as-a-judge producing reliable relevance scores and on taxonomy-derived negatives being meaningfully harder than random sampling, yet the manuscript provides insufficient detail on the exact prompting strategy, scoring rubric, and negative construction procedure, preventing verification that these choices are load-bearing for the reported improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important issues of precision in the abstract and the need for greater methodological transparency. We have revised the manuscript to address both points directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The attribution of the null A/B result on long-tail queries to the statistic that '67% of popular searches exhibit zero-click rates above 50%' does not directly support the synthesis, because the cited zero-click pattern applies to popular searches while the A/B test explicitly uses long-tail queries; long-tail queries may exhibit different intent distributions and lower zero-click rates, leaving open alternative explanations such as insufficient effect size or metric choice.
Authors: We agree that the original phrasing created an imprecise link between the zero-click statistic (derived from popular searches) and the A/B test results (on long-tail queries). This leaves open the possibility of other factors, including effect size or metric sensitivity. In the revised abstract and discussion, we have removed the direct attribution, instead presenting the 67% figure as contextual background on overall user behavior at IKEA while explicitly noting that long-tail queries may differ and that further query-specific analysis would be valuable. No new data on long-tail zero-click rates is added, as it was not collected in this study. revision: yes
-
Referee: [Abstract] Abstract and methodology description: The central +2.6% category accuracy claim depends on the LLM-as-a-judge producing reliable relevance scores and on taxonomy-derived negatives being meaningfully harder than random sampling, yet the manuscript provides insufficient detail on the exact prompting strategy, scoring rubric, and negative construction procedure, preventing verification that these choices are load-bearing for the reported improvement.
Authors: We concur that the original manuscript lacked sufficient detail on these components, which are central to reproducing and validating the +2.6% gain. The revised manuscript now includes an expanded methodology section with: (1) the complete LLM prompt templates and system instructions used for relevance scoring, (2) the full scoring rubric (a 0-5 scale with explicit criteria for each level based on query-product semantic match, attribute alignment, and category relevance), and (3) the precise algorithm for constructing taxonomy-based negatives (including selection of sibling categories in the hierarchy, attribute mismatch sampling, and exclusion of random negatives for comparison). These additions make the load-bearing role of the structured approach verifiable. revision: yes
Circularity Check
No significant circularity in empirical results
full rationale
The paper reports empirical outcomes from offline real-user query experiments (+2.6% category accuracy) and an A/B test on long-tail queries (no significant engagement lift, p>0.05), attributing the discrepancy to observed zero-click statistics on popular searches. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the presented chain. The methodology relies on structured negative sampling, LLM-based scoring, and live testing without any step that reduces by construction to its own inputs. This is a standard empirical retrieval paper whose central claims are externally falsifiable via the reported experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The LLM-as-a-judge system produces reliable relevance scores that can replace or augment human annotations for training data generation.
Reference graph
Works this paper leans on
-
[1]
Nan Bi, Pablo Castells, Daniel Gilbert, Slava Galperin, Patrick Tardif, and Sachin Ahuja. 2022. Debiased Balanced Interleaving at Amazon Search. InProceedings of the 31st ACM International Conference on Information and Knowledge Management. 3798–3802. doi:10.1145/3511808.3557123
-
[2]
Amritpal Singh Gill, Sannikumar Patel, Péter Varga, Patrick Miller, and Sakis Athanasiadis. 2025. From Keywords to Concepts: A Late Interaction Approach to Semantic Product Search on IKEA.com. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025). ACM
2025
-
[3]
Guangda Huzhang, Zhen-Jia Pang, Yongqing Gao, Yawen Liu, Weijie Shen, Wen- Ji Zhou, Qing Da, Anxiang Zeng, Han Yu, Yang Yu, and Zhi-Hua Zhou. 2021. AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online.IEEE Transactions on Knowledge and Data Engineering34, 8 (2021), 3941–3954. doi:10.1109/TKDE.2020.3039758
-
[4]
Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard Negative Mixing for Contrastive Learning. InAdvances in Neural Information Processing Systems, Vol. 33. 21798–21809
2020
-
[5]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. doi:10.18653/v1/...
-
[6]
Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based Product Retrieval in Taobao Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3181–3189. doi:10.1145/3447548.3467101
-
[7]
Juexin Lin, Sachin Yadav, Feng Liu, Nicholas Rossi, Praveen R. Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra, Tony Lee, Alessandro Mag- nani, and Ciya Liao. 2024. Enhancing Relevance of Embedding-based Retrieval at Walmart. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 1359–1368. doi:10.114...
-
[8]
Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, and Ciya Liao. 2022. Semantic Retrieval at Walmart. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3495–3503. doi:10.1145/3534678.3539164
-
[9]
Nv-retriever: Improving text embedding models with effective hard-negative mining
Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. 2024. NV-Retriever: Improving Text Embedding Models with Effective Hard-Negative Mining.arXiv preprint arXiv:2407.15831 (2024)
-
[10]
Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra
Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. Towards Understanding Bias in Synthetic Data for Evaluation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. doi:10.1145/3746252.3760908
-
[11]
Rosario, Abhijeet Phatak, He Wen, Swati Kirti, and Chittaranjan Tripathy
Jayant Sachdev, Sean D. Rosario, Abhijeet Phatak, He Wen, Swati Kirti, and Chittaranjan Tripathy. 2025. Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search. InProceedings of NLPIR
2025
-
[12]
32–40. doi:10.1145/3711542.3711582
- [13]
-
[14]
Sriram Somanchi, Ahmed Abbasi, Ken Kelley, David Dobolyi, and Ted Tao Yuan
-
[15]
Examining User Heterogeneity in Digital Experiments.ACM Transactions on Information Systems41, 4 (2023), 1–28. doi:10.1145/3578931
-
[16]
Xiaojie Wang, Ruoyuan Gao, Anoop Jain, Graham Edge, and Sachin Ahuja. 2023. How Well do Offline Metrics Predict Online Performance of Product Ranking Models?. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2349–2354. doi:10.1145/3539618. 3591865
-
[17]
Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuancheng Li, Jiaji Li, Xuesong Chen, Min Zhang, and Shaoping Ma. 2022. Why Don’t You Click: Understanding Non-Click Results in Web Search with Brain Signals. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 233–243. doi:10.1145/3477495.3532082
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.