RAD-DPO: Robust Adaptive Denoising Direct Preference Optimization for Generative Retrieval in E-commerce
Pith reviewed 2026-05-15 19:02 UTC · model grok-4.3
The pith
RAD-DPO refines direct preference optimization for structured semantic IDs by detaching prefix gradients, weighting rewards by similarity, and adding global contrastive coverage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAD-DPO addresses three limitations of direct preference optimization on structured Semantic IDs: token-level gradient detachment prevents penalization of shared hierarchical prefixes, similarity-based dynamic reward weighting mitigates noisy pseudo-negatives from implicit feedback, and a multi-label global contrastive objective integrated with global SFT loss expands positive coverage to reduce the probability squeezing effect among valid candidates.
What carries the argument
RAD-DPO, which integrates token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to reduce label noise, and a multi-label global contrastive objective combined with global SFT loss to expand positive coverage.
If this is right
- Generative retrieval models can be aligned more reliably with real user preferences in hierarchical ID spaces.
- Training time and compute can be reduced while raising precision in large-scale e-commerce search.
- Multi-label queries no longer force probability mass to concentrate on only a subset of relevant items.
- Implicit feedback data becomes usable without manual cleaning because noise is dynamically down-weighted.
- The same alignment approach can be applied to other autoregressive structured decoding tasks beyond product search.
Where Pith is reading between the lines
- The method may transfer to other domains that use hierarchical or tree-structured outputs, such as code generation or knowledge-base completion.
- Longer-term user retention and click-through patterns after deployment would be a useful next measurement beyond the reported A/B metrics.
- Combining RAD-DPO with techniques that explicitly model user context or session history could further reduce the remaining error rate on tail queries.
Load-bearing premise
The three added components resolve DPO's specific failures on structured SIDs without creating new instabilities or biases during training.
What would settle it
A controlled ablation on the same JD.com dataset showing that removing any single component (gradient detachment, dynamic weighting, or global contrastive term) produces no gain or a loss in retrieval precision or training stability would falsify the central claim.
Figures
read the original abstract
Generative Retrieval (GR) is rapidly transforming e-commerce search by replacing traditional multi-stage pipelines with the autoregressive decoding of structured Semantic IDs (SIDs). Despite this architectural efficiency, aligning GR models with nuanced, real-world user preferences remains a critical challenge. While Direct Preference Optimization (DPO) offers an efficient alignment solution, its direct application to structured SIDs suffers from three limitations: (i) it penalizes shared hierarchical prefixes, causing gradient conflicts; (ii) it is vulnerable to noisy pseudo-negatives from implicit feedback; and (iii) in multi-label queries with multiple relevant items, it exacerbates a probability "squeezing effect" among valid candidates. To address these issues, we propose RAD-DPO, which introduces token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to mitigate label noise, and a multi-label global contrastive objective integrated with global SFT loss to explicitly expand positive coverage. Extensive offline evaluations and large-scale online A/B testing on JD.com's core search engine demonstrate that RAD-DPO achieves significant improvements in both retrieval precision and training efficiency, proving its robustness for massive industrial deployments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RAD-DPO as an extension of Direct Preference Optimization tailored to generative retrieval models that decode structured Semantic IDs (SIDs) for e-commerce search. It identifies three limitations of vanilla DPO on hierarchical SIDs—gradient conflicts on shared prefixes, sensitivity to noisy pseudo-negatives, and probability squeezing in multi-label settings—and introduces three fixes: token-level gradient detachment, similarity-based dynamic reward weighting, and a multi-label global contrastive objective combined with global SFT loss. The central claim is that these changes yield significant gains in retrieval precision and training efficiency, supported by offline evaluations and large-scale online A/B tests on JD.com.
Significance. If the experimental claims hold under scrutiny, the work would be significant for industrial generative retrieval: it offers a practical, targeted adaptation of preference optimization to hierarchical structured outputs, which are increasingly used in production search systems. The inclusion of both offline metrics and real-world A/B testing on a major e-commerce platform provides direct evidence of deployability, potentially influencing how alignment techniques are scaled for autoregressive retrieval models.
major comments (1)
- [§4 and §5] §4 (Experiments) and §5 (Online A/B Testing): The manuscript asserts 'significant improvements' in precision and efficiency from offline evaluations and large-scale online A/B tests, yet provides no concrete information on the chosen baselines (e.g., standard DPO, other GR alignment methods), exact metrics (NDCG@K, Recall@K, etc.), effect sizes, statistical significance tests, or experimental controls such as traffic split, duration, or user cohort size. Without these details the central empirical claim cannot be properly assessed.
minor comments (2)
- [Abstract] Abstract: The acronym 'SIDs' is introduced without an initial expansion, even though the full term 'Semantic IDs' appears later; this should be corrected for immediate clarity.
- [§3] §3 (Method): The integration of the multi-label global contrastive objective with the global SFT loss is described at a high level; an explicit combined loss equation would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript describing RAD-DPO. We address the major comment below and will revise the paper to improve clarity and completeness of the experimental reporting.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experiments) and §5 (Online A/B Testing): The manuscript asserts 'significant improvements' in precision and efficiency from offline evaluations and large-scale online A/B tests, yet provides no concrete information on the chosen baselines (e.g., standard DPO, other GR alignment methods), exact metrics (NDCG@K, Recall@K, etc.), effect sizes, statistical significance tests, or experimental controls such as traffic split, duration, or user cohort size. Without these details the central empirical claim cannot be properly assessed.
Authors: We acknowledge that the current presentation of results in §4 and §5 could be more explicit. In the revised manuscript we will expand these sections to: (i) enumerate all baselines with their exact configurations (standard DPO, other GR alignment methods, and non-alignment GR models); (ii) report the precise metrics (NDCG@10, Recall@50, etc.) together with absolute values, relative improvements, and effect sizes; (iii) include statistical significance results (paired t-tests or bootstrap p-values); and (iv) detail the online A/B test protocol, including traffic allocation (50/50 split), test duration (14 days), and cohort size (tens of millions of users). These additions will be placed in the main text and tables so that the empirical claims can be fully evaluated. The core technical contributions and the reported gains remain unchanged. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces RAD-DPO by defining three explicit new components (token-level gradient detachment, similarity-based dynamic reward weighting, multi-label global contrastive loss + global SFT) to fix stated DPO limitations on hierarchical SIDs. These are presented as engineering extensions, not derived from prior equations or self-citations. Validation rests on offline metrics and large-scale online A/B tests on JD.com rather than any reduction of outputs to fitted inputs or self-referential definitions. No load-bearing step collapses to a self-citation chain, ansatz smuggled via citation, or renaming of known results; the central claims remain empirically grounded and independent of the method's own fitted quantities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard DPO applied directly to structured Semantic IDs causes gradient conflicts on shared hierarchical prefixes.
- domain assumption Implicit feedback in e-commerce contains noisy pseudo-negatives that degrade DPO training.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiang- nan He. 2023. Bias and debias in recommender system: a survey and future directions.ACM Transactions on Information Systems, 41, 3, 1–39
work page 2023
-
[4]
Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, and Tat-Seng Chua. 2024. On softmax direct preference optimiza- tion for recommendation.Advances in Neural Information Processing Systems, 37, 27463–27489
work page 2024
-
[5]
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. Onerec: unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [6]
- [7]
- [8]
-
[9]
Jiwoo Hong, Noah Lee, and James Thorne. 2024. Orpo: monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 11170–11189
work page 2024
-
[10]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay
-
[11]
Accurately interpreting clickthrough data as implicit feedback. InAcm Sigir Forumnumber 1. Vol. 51. Acm New York, NY, USA, 4–11
-
[12]
Zhirui Kuai et al. 2024. Breaking the hourglass phenomenon of residual quanti- zation: enhancing the upper bound of generative retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 677–685
work page 2024
-
[13]
Jian Li, Shenglin Yin, Yujia Zhang, Alan Zhao, Xi Chen, Xiaohui Zhou, and Pengfei Xu. 2025. Ambiguity awareness optimization: towards semantic dis- ambiguation for direct preference optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 9064–9074
work page 2025
- [14]
- [15]
-
[16]
Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. Simpo: simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37, 124198–124235
work page 2024
-
[17]
Abhijnan Nath, Andrey Volozin, Saumajit Saha, Albert Aristotle Nanda, Galina Grunin, Rahul Bhotika, and Nikhil Krishnaswamy. 2025. Dpl: diverse preference learning without a reference model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
work page 2025
-
[18]
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management, 2685–2692
work page 2020
-
[19]
Yiming Qiu et al. 2022. Pre-training tasks for user intent detection and embed- ding retrieval in e-commerce search. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, 4424–4428
work page 2022
-
[20]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: your language model is secretly a reward model.Advances in neural information processing systems, 36, 53728–53741
work page 2023
-
[21]
Shashank Rajput et al. 2023. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36, 10299–10315
work page 2023
- [22]
- [23]
-
[24]
Yi Tay et al. 2022. Transformer memory as a differentiable search index.Ad- vances in neural information processing systems, 35, 21831–21843
work page 2022
-
[25]
Yujing Wang et al. 2022. A neural corpus indexer for document retrieval.Ad- vances in Neural Information Processing Systems, 35, 25600–25614
work page 2022
-
[26]
Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. 2024. 𝛽-DPO: Direct Preference Optimization with dynamic 𝛽.Advances in Neural Information Processing Systems, 37, 129944– 129966
work page 2024
-
[27]
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: packaged resources to advance general chinese embedding. (2023). arXiv: 2309 .07597[cs.CL]
work page 2023
-
[28]
Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, and Vasant G Honavar
-
[29]
Cal-dpo: calibrated direct preference optimization for language model alignment.Advances in Neural Information Processing Systems, 37, 114289– 114320
-
[30]
Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, and Haijun Zhang
- [31]
- [32]
-
[33]
Han Zhang, Yunjiang Jiang, Mingming Li, Haowei Yuan, Yiming Qiu, and Wen- Yun Yang. 2025. Pebr: a probabilistic approach to embedding based retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2332–2342
work page 2025
- [34]
- [35]
-
[36]
Chuan Zhou, Lina Yao, Haoxuan Li, and Mingming Gong. 2025. Counterfactual implicit feedback modeling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Company Portrait JD.com, Inc., also known as Jingdong, is a Chinese e-commerce company headquartered in Beijing. It is one of the two massive B2C online retailers in China by t...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.