DUET -- Dual User Embedding Transformers for Offsite Conversion Prediction
Pith reviewed 2026-06-27 16:54 UTC · model grok-4.3
The pith
DUET pre-trains two separate transformer encoders on clicks and conversions to improve offsite conversion predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DUET partitions user behavioral data into clicks and conversions and pre-trains dedicated transformer encoders with architectures matched to each stream's statistics: multi-layer self-attention for the dense click stream and interleaved cross- and self-attention for the sparse conversion stream. The complementary embeddings are jointly consumed by a downstream ranker without exceeding serving-latency budgets.
What carries the argument
DUET's pair of domain-specific transformer encoders that produce separate click and conversion user embeddings for joint use in ranking.
If this is right
- The two embeddings combine in the ranker to produce higher-accuracy OCVR predictions.
- Click and conversion signals receive attention patterns matched to their density and delay characteristics.
- Serving latency remains unchanged while accuracy improves.
- The approach yields measurable gains on both offline NE and online A/B metrics.
Where Pith is reading between the lines
- The same partitioning logic could be tested on other ranking tasks that mix high-volume short-term events with low-volume delayed outcomes.
- One could measure whether the benefit scales with the degree of statistical mismatch between the two streams.
- The method leaves open whether the downstream ranker itself could be further specialized once the embeddings are already separated.
Load-bearing premise
That the statistical differences between click and conversion signals are best handled by two separate tailored encoders rather than a single shared encoder.
What would settle it
A controlled experiment in which a single unified encoder matches or exceeds the 0.38 percent NE reduction and A/B accuracy gains on identical offsite conversion data and latency budgets.
read the original abstract
Offsite conversion rate (OCVR) prediction is an important ranking problem in computational recommendation systems. This task presents a modeling challenge: click signals are abundant and exhibit short temporal horizons, whereas conversion signals are inherently sparse, long-delayed, and frequently unattributed. Despite these statistical disparities, both signal types must inform models that operate within strict serving-latency constraints. Prior pre-training approaches address this heterogeneity with a single, undifferentiated encoder applied uniformly across both data streams. We propose DUET (Dual User Embedding Transformers), a framework that explicitly partitions user behavioral data into two domain-coherent streams -- clicks and conversions -- and pre-trains dedicated transformer encoders with architectures tailored to each stream's statistical characteristics: multi-layer self-attention for the dense click stream and interleaved cross- and self-attention for the sparse conversion stream. The resulting complementary embeddings are jointly consumed by a downstream ranker without exceeding serving-latency budgets. Evaluation demonstrates up to 0.38% normalized entropy (NE) reduction relative to the strongest baseline, and A/B test shows consistent improvements in OCVR prediction accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DUET, a dual-encoder transformer framework for offsite conversion rate (OCVR) prediction that partitions user data into click and conversion streams and pre-trains separate encoders (multi-layer self-attention for dense clicks; interleaved cross-/self-attention for sparse conversions). The resulting embeddings are fed to a downstream ranker. The central empirical claim is an improvement of up to 0.38% normalized entropy (NE) over the strongest baseline together with positive A/B-test results on OCVR accuracy, all while respecting serving-latency constraints.
Significance. If the reported NE gain is shown to be attributable to the dual tailored-encoder design rather than capacity or hyper-parameter differences, the work would supply a concrete architectural response to the statistical mismatch between abundant short-horizon click signals and sparse long-delayed conversion signals, a recurring issue in large-scale recommendation systems.
major comments (2)
- [Abstract] Abstract: the claim of up to 0.38% NE reduction is presented without any description of the baselines, total parameter counts, dataset sizes, statistical significance tests, or ablation studies that compare the dual-encoder architecture against a single shared encoder of matched capacity and with appropriate sparsity handling (masking, variable-length sequences). This information is load-bearing for the central claim that the dual design itself is responsible for the gain.
- [Abstract] Abstract: the manuscript asserts that the two domain-coherent streams are best addressed by architecture-tailored encoders whose outputs combine effectively in the downstream ranker, yet supplies no capacity-controlled comparison or latency measurement that would rule out the possibility that a single encoder with unified pre-training objective and equivalent total parameters would achieve the same result.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on the abstract and the need to substantiate the central claim. We will revise the abstract to incorporate the requested details on baselines, datasets, and significance, and we will strengthen the experimental section with additional capacity-controlled ablations where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of up to 0.38% NE reduction is presented without any description of the baselines, total parameter counts, dataset sizes, statistical significance tests, or ablation studies that compare the dual-encoder architecture against a single shared encoder of matched capacity and with appropriate sparsity handling (masking, variable-length sequences). This information is load-bearing for the central claim that the dual design itself is responsible for the gain.
Authors: We agree that the abstract, due to its brevity, omits several load-bearing details. The full manuscript describes the baselines (including single-encoder variants) in Section 3, reports dataset sizes and characteristics in Section 4.1, lists total parameter counts in Table 2, and evaluates statistical significance via bootstrap resampling in Section 4.2. Ablations addressing sparsity handling (masking and variable-length sequences) appear in Section 4.3. We will expand the abstract with a concise sentence summarizing these elements and the key ablation findings. revision: yes
-
Referee: [Abstract] Abstract: the manuscript asserts that the two domain-coherent streams are best addressed by architecture-tailored encoders whose outputs combine effectively in the downstream ranker, yet supplies no capacity-controlled comparison or latency measurement that would rule out the possibility that a single encoder with unified pre-training objective and equivalent total parameters would achieve the same result.
Authors: The manuscript already contains capacity-controlled comparisons in Section 4.3, where a single shared encoder with matched total parameter count and a unified pre-training objective is evaluated against the dual design; the dual architecture shows consistent gains. Latency measurements confirming that the dual encoders remain within serving budgets are reported in Section 5 and Table 5. We will revise the abstract to explicitly reference these controlled comparisons and latency results. revision: yes
Circularity Check
No circularity in derivation; proposal is empirical architecture choice.
full rationale
The manuscript describes a modeling framework that partitions click and conversion streams into separate encoders with tailored attention patterns, then feeds the embeddings to a downstream ranker. No equations, parameter-fitting steps, or predictions are defined in the provided text. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented. The claimed 0.38% NE reduction is an empirical outcome, not a quantity derived by construction from the inputs. The derivation chain is therefore self-contained and does not reduce to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Zheng Chai, Hui Lu, Di Chen, Qin Ren, Yuchao Zheng, and Xun Zhou. 2025a. Adaptive Domain Scaling for Personalized Sequential Modeling in Recommenders. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). ACM, 4234–4238. doi:10.1145/3726302.3731939 Zheng Chai, Qin Ren, Xijun Xiao, Hu...
-
[2]
A Generalized Doubly Robust Learning Framework for Debiasing Post-Click Conversion Rate Prediction. arXiv:2211.06684 [cs.LG]https://arxiv.org/abs/2211.06684 Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin
-
[3]
arXiv:2002.06987 [cs.LG]https://arxiv.org/ abs/2002.06987 Ke Fei, Xinyue Zhang, and Jingjing Li
DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. arXiv:2002.06987 [cs.LG]https://arxiv.org/ abs/2002.06987 Ke Fei, Xinyue Zhang, and Jingjing Li
arXiv 2002
-
[4]
Entire-Space Variational Information Exploitation for Post-Click Conversion Rate Prediction. arXiv:2502.15687 [cs.IR]https://arxiv.org/abs/2502.15687 12 Zichuan Fu, Xiangyang Li, Chuhan Wu, Yichao Wang, Kuicai Dong, Xiangyu Zhao, Mengchen Zhao, Huifeng Guo, and Ruiming Tang
-
[5]
arXiv:2312.10743 [cs.IR]https://arxiv.org/abs/2312.10743 Vineet Gupta, Tomer Koren, and Yoram Singer
A Unified Framework for Multi-Domain CTR Prediction via Large Language Models. arXiv:2312.10743 [cs.IR]https://arxiv.org/abs/2312.10743 Vineet Gupta, Tomer Koren, and Yoram Singer
-
[6]
arXiv:1802.09568 [cs.LG]https://arxiv.org/abs/1802.09568v2 Wang-Cheng Kang and Julian McAuley
Shampoo: Preconditioned Stochastic Tensor Optimization. arXiv:1802.09568 [cs.LG]https://arxiv.org/abs/1802.09568v2 Wang-Cheng Kang and Julian McAuley
-
[7]
Self-Attentive Sequential Recommendation. arXiv:1808.09781 [cs.IR] https://arxiv.org/abs/1808.09781 Chao Li, Zhiyuan Liu, Mengmeng Wu, Yuchi Xu, Pipei Huang, Huan Zhao, Guoliang Kang, Qiwei Chen, Wei Li, and Dik Lun Lee
-
[8]
Multi-Interest Network with Dynamic Routing for Recommendation at Tmall. arXiv:1904.08030 [cs.IR]https://arxiv.org/abs/1904.08030 Haoxuan Li, Yan Lyu, Chunyuan Zheng, and Peng Wu
Pith/arXiv arXiv 1904
-
[9]
TDR-CL: Targeted Doubly Robust Collaborative Learning for Debiased Recommendations. arXiv:2203.10258 [cs.IR]https://arxiv.org/abs/2203.10258 Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi. 2018b. Modeling Task Relationships in Multi-Task Learning with Multi-Gate Mixture-of-Experts. InProceedings of the 24th ACM SIGKDD International...
-
[10]
SIDE: Semantic ID Embedding for effective learning from sequences. arXiv:2506.16698 [cs.LG] https://arxiv.org/abs/2506.16698 Sohini Roychowdhury, Doris Wang, Qian Ge, Joy Mu, and Srihari Reddy
-
[11]
COFFEE: COdesign Framework for Feature Enriched Embeddings in Ads-Ranking Systems. arXiv:2601.02807 [cs.IR]https://arxiv.org/abs/2601.02807 Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang Luo, Siran Yang, Jingshan Lv, Chi Zhang, Hongbo Deng, and Xiaoqiang Zhu
-
[12]
arXiv:2101.11427 [cs.IR]https://arxiv.org/abs/2101.11427 Alex Shtoff, Yohay Kaplan, and Ariel Raviv
One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction. arXiv:2101.11427 [cs.IR]https://arxiv.org/abs/2101.11427 Alex Shtoff, Yohay Kaplan, and Ariel Raviv
-
[13]
In2023 IEEE International Conference on Big Data (BigData)
Improving conversion rate prediction via self-supervised pre- training in online advertising. In2023 IEEE International Conference on Big Data (BigData). IEEE, 1835–1842. doi:10.1109/bigdata59044.2023.10386162 Runze Su, Jiayin Jin, Jiacheng Li, Sihan Wang, Guangtong Bai, Zelun Wang, Li Tang, Yixiong Meng, Huasen Wu, Zhimeng Pan, Kungang Li, Han Sun, Zhifa...
-
[14]
Multi-Faceted Large Embedding Tables for Pinterest Ads Ranking. arXiv:2508.05700 [cs.IR]https://arxiv.org/abs/2508.05700 Mingwei Tang, Meng Liu, Hong Li, Junjie Yang, Chenglin Wei, Boyang Li, Dai Li, Rengan Xu, Yifan Xu, Zehua Zhang, Xiangyu Wang, Linfeng Liu, Yuelei Xie, Chengye Liu, Labib Fawaz, Li Li, Hongnan Wang, Bill Zhu, and Sri Reddy
-
[15]
Async Learned User Embeddings for Ads Delivery Optimization. arXiv:2406.05898 [cs.IR] https://arxiv.org/abs/2406.05898 Zhen Tian, Changwang Zhang, Wayne Xin Zhao, Xin Zhao, Ji-Rong Wen, and Zhao Cao
-
[16]
UFIN: Universal Feature Interaction Network for Multi-Domain Click-Through Rate Prediction. arXiv:2311.15493 [cs.IR]https: //arxiv.org/abs/2311.15493 Hao Wang, Tai-Wei Chang, Tianqiao Liu, Jianmin Huang, Zhichao Chen, Chao Yu, Ruopeng Li, and Wei Chu
-
[17]
ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). ACM, 363–372. doi:10.1145/3477495.3531972 13 Junting Wang, Praneet Rathi, and Hari Sundaram
-
[18]
In18th ACM Conference on Recommender Systems (RecSys ’24)
A Pre-trained Zero-shot Sequential Recommendation Framework via Popularity Dynamics. In18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 433–443. doi:10.1145/3640457.3688145 Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang
-
[19]
Deep & Cross Network for Ad Click Predictions. arXiv:1708.05123 [cs.LG]https://arxiv.org/abs/1708.05123 Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi
-
[20]
InProceedings of the Web Conference 2021 (WWW ’21)
DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. InProceedings of the Web Conference 2021 (WWW ’21). ACM, 1785–1797. doi:10.1145/3442381.3450078 Yuhan Wang, Qing Xie, Zhifeng Bao, Mengzi Tang, Lin Li, and Yongjian Liu
-
[21]
InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25)
Enhancing Transferability and Consistency in Cross-Domain Recommendations via Supervised Disentanglement. InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). ACM, 104–113. doi:10.1145/3705328.3748044 Zhibo Xiao, Luwei Yang, Weiwen Jiang, Yi Wei, Yi Hu, and Hao Wang
-
[22]
Contrastive Cross-domain Recommendation in Matching. arXiv:2112.00999 [cs.IR]https://arxiv.org/abs/2112.00999 Lee Xiong, Zhirong Chen, Rahul Mayuranath, Shangran Qiu, Arda Ozdemir, Lu Li, Yang Hu, Dave Li, Jingtao Ren, Howard Cheng, Fabian Souto Herrera, Ahmed Agiza, Baruch Epshtein, Anuj Aggarwal, Julia Ulziisaikhan, Chao Wang, Dinesh Ramasamy, Parshva D...
-
[23]
arXiv:2601.20083 [cs.IR]https://arxiv.org/ abs/2601.20083 Zitao Xu, Weike Pan, and Zhong Ming
LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation. arXiv:2601.20083 [cs.IR]https://arxiv.org/ abs/2601.20083 Zitao Xu, Weike Pan, and Zhong Ming
-
[24]
InProceedings of the 17th ACM Conference on Recommender Systems(Singapore, Singapore)(RecSys ’23)
A Multi-view Graph Contrastive Learning Framework for Cross-Domain Sequential Recommendation. InProceedings of the 17th ACM Conference on Recommender Systems(Singapore, Singapore)(RecSys ’23). Association for Computing Machinery, New York, NY, USA, 491–501. doi:10.1145/ 3604915.3608785 Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yan...
-
[25]
DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction. arXiv:2203.11014 [cs.IR]https://arxiv.org/abs/2203.11014 Gaowei Zhang, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2023a. Scaling Law of Large Sequential Recommendation Models. arXiv:2311.11351 [cs.IR]https://arxiv.org/abs/2311.11351 Wei Zhan...
-
[26]
Feature Decomposition for Reducing Negative Transfer: A Novel Multi-task Learning Method for Recommender System. arXiv:2302.05031 [cs.IR]https://arxiv.org/abs/2302.05031 14 Jiajie Zhu, Yan Wang, Feng Zhu, and Zhu Sun
-
[27]
InProceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23)
Domain Disentanglement with Interpolative Data Augmentation for Dual-Target Cross-Domain Recommendation. InProceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM, 515–527. doi:10.1145/3604915.3608802 15 7 Appendix 7.1 Upstream Encoder Details We optimize the models using Distributed Shampoo optimizer Gupta et al. (2018), a second-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.