On the Practice of Scaling Search Conversion Rate Prediction

Alessandro Magnani; Fan Zhang; Henry Tsai; James Pak; Johnny Chen; Jordan Gilliland; Juexin Lin; Jyun-Yu Jiang; Mohitdeep Singh; Qian Zhao

arxiv: 2605.29232 · v1 · pith:A3TAYHBJnew · submitted 2026-05-28 · 💻 cs.IR

On the Practice of Scaling Search Conversion Rate Prediction

James Pak , Jyun-Yu Jiang , Fan Zhang , Sen Wang , Taekmin Kim , Henry Tsai , Vijay Rajaram , Juexin Lin

show 8 more authors

Mohitdeep Singh Alessandro Magnani Johnny Chen Qian Zhao Rao Fu Zhirong Liang Jordan Gilliland Winter Jiao

This is my paper

Pith reviewed 2026-06-29 06:01 UTC · model grok-4.3

classification 💻 cs.IR

keywords search conversion rate predictionmodel scalingCVRembedding scalingtraining data volumeinference optimizationwarm-start trainingA/B testing

0 comments

The pith

The effects of scaling model backbone, embedding size, and training data on search CVR prediction are largely independent and additive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how to scale CVR prediction models for high-traffic e-commerce search while respecting training cost and serving latency limits. Experiments on over a year of production logs show that gains from larger backbones, bigger embeddings, and more training data combine without strong interactions. A warm-start approach speeds up retraining, and inference tricks such as decoupled graph execution plus dynamic batching keep even high-capacity models under latency budgets. The resulting model uses 2.5 times more data and 8 times more inference compute yet shows almost no latency increase. Online A/B tests record a combined 2.6 percent improvement in search conversion rate.

Core claim

The central claim is that scaling backbone computation, embedding parameter size, and training data volume produce largely independent and additive quality gains for search CVR models. On a large production dataset the authors find that choosing the right backbone and scaling factors matters most, that a streamlined warm-start strategy accelerates training iterations, and that inference optimizations such as decoupled graph execution and dynamic batching permit low-latency GPU serving. These steps together allow deployment of a model trained on 2.5 times more data with 8 times more inference compute while keeping latency nearly unchanged, delivering a combined 2.6 percent lift in conversion

What carries the argument

The independence and additivity of scaling effects across backbone, embeddings, and data volume, which supports modular and efficient scaling decisions.

If this is right

Scaling exploration becomes more efficient because the three factors can be varied separately rather than jointly.
A warm-start strategy accelerates training iterations and simplifies deployment of model updates.
Decoupled graph execution and dynamic batching enable low-latency GPU serving for high-capacity models.
Models trained on 2.5 times more data with 8 times more inference compute can be deployed with minimal latency impact.
Online A/B tests show a combined 2.6 percent gain in search conversion rate from the scaled models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same additive scaling pattern may appear in other ranking or recommendation tasks that rely on similar embedding-plus-backbone architectures.
Production teams could reduce exhaustive grid searches by treating the three scaling axes as roughly orthogonal.
Testing whether the pattern persists after the next architecture refresh would be a direct next experiment.
Data scaling may become the highest-leverage lever once inference optimizations are in place.

Load-bearing premise

The observed independence and additivity of scaling effects will hold for the production dataset and architectures tested and will generalize to future updates.

What would settle it

Running the same scaling sweeps on a new high-traffic dataset or architecture and measuring large non-additive interactions between the three factors would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.29232 by Alessandro Magnani, Fan Zhang, Henry Tsai, James Pak, Johnny Chen, Jordan Gilliland, Juexin Lin, Jyun-Yu Jiang, Mohitdeep Singh, Qian Zhao, Rao Fu, Sen Wang, Taekmin Kim, Vijay Rajaram, Winter Jiao, Zhirong Liang.

**Figure 2.** Figure 2: FLOPs vs mAP gain against DCNv2 for different scaling factors. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of data scaling for a MaskNet model on mAP. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: FLOPs vs mAP gain for models and scaling factors. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Performance gains over different durations of training data with (a) different model [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: P99 latency of CPU serving, as well as unoptimized and optimized GPU serving with [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Scaling a Search Conversion Rate (CVR) prediction model, especially in high-traffic environments, presents a challenge: superior model quality needs to be balanced with strict constraints on training cost and serving latency. This paper details an effective approach for scaling modern search CVR prediction models. We begin with an empirical study to understand the scaling performance of search CVR models, analyzing how quality improves as we scale three key factors of model backbone computation, the size of embedding parameters, and the volume of training data. We use a large-scale production dataset, comprising over a year of customer interaction logs from a high-traffic e-commerce platform, to evaluate the scalability of several state-of-the-art architectures and their ensembles. Our key findings are: (1) selecting the right backbone and scaling factors is crucial; (2) the impact of scaling backbone, embedding, and data is largely independent and additive, which has implications for more efficient scaling exploration; (3) a streamlined warmstart strategy can accelerate training iterations while simplifying new updates; (4) inference optimization strategies such as decoupled graph execution and dynamic batching can enable low-latency GPU serving even for high-capacity models. Compared to a baseline of a pre-scaling production model, we ultimately deployed a model trained on 2.5x larger training data with 8x more inference compute while having minimal latency impact. Online A/B tests also demonstrate that our launches achieved a combined +2.6% gain in a key metric of search conversion rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling backbone, embeddings, and data show additive effects in this production CVR setup, backed by A/B tests, but the work is mostly empirical reporting on one domain.

read the letter

The punchline for this paper is that the effects of scaling model backbone compute, embedding size, and training data are largely independent and additive for their search CVR task. This let them train on 2.5x data with 8x inference compute, keep latency similar via optimizations, and see a combined 2.6% improvement in conversion rate from online A/B tests.

What the paper does well is provide concrete production evidence. They use a year-plus of real customer logs from a high-traffic platform and evaluate several state-of-the-art architectures plus ensembles. The deployment story, including the warm-start strategy to speed up training and the inference tricks like decoupled graph execution and dynamic batching, gives practical details that could help others facing the same quality versus cost tradeoffs. The fact that they actually shipped the model and measured the lift online is a strength.

The soft spots are mostly around generalization and detail. Everything is from one e-commerce domain, so the independence finding might not transfer to other settings or model types. The abstract mentions the key findings but does not describe how they controlled for variables or quantified the additivity, which leaves the main result a bit underspecified. Since this is empirical work using existing methods rather than a new algorithm or derivation, the novelty is limited to the specific observations in this context.

This paper is for people who build and scale production models for search conversion or similar ranking tasks in e-commerce. A reader in industry or applied ML would get value from the scaling guidance and the optimization techniques. It deserves a serious referee because the live A/B results and large-scale data make the findings worth checking in detail, even if the scope is narrow.

Referee Report

1 major / 1 minor

Summary. The paper conducts an empirical study on scaling search CVR prediction models by varying backbone computation, embedding size, and training data volume on a large production e-commerce dataset spanning over a year of logs. It evaluates several SOTA architectures and ensembles, reports that the scaling effects are largely independent and additive, describes a warm-start training strategy and inference optimizations (decoupled graph execution, dynamic batching), and validates a final deployed model (2.5x data, 8x inference compute) via online A/B tests showing a combined +2.6% gain in search conversion rate.

Significance. If the reported independence and additivity hold, the result would allow more efficient scaling exploration in production recommender systems by avoiding exhaustive joint searches over the three factors. The manuscript supplies direct production evidence via live A/B tests on the final deployed model, which is a strength.

major comments (1)

[Abstract] Abstract: the claim that online A/B tests demonstrate a combined +2.6% gain supplies no details on experimental controls, statistical testing procedures, sample sizes, data exclusion rules, or quantification of independence, leaving the central scaling claims only partially supported.

minor comments (1)

The manuscript would benefit from an explicit table or figure summarizing the scaling curves (backbone, embedding, data) across the tested architectures to make the additivity claim easier to verify.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that online A/B tests demonstrate a combined +2.6% gain supplies no details on experimental controls, statistical testing procedures, sample sizes, data exclusion rules, or quantification of independence, leaving the central scaling claims only partially supported.

Authors: We agree that the abstract would benefit from additional context on the A/B test methodology to better support the claims. In the revised manuscript we will update the abstract to briefly note that the tests used standard randomized traffic splitting on live search traffic, assessed significance via appropriate statistical tests at p<0.05, operated on sample sizes of tens of millions of impressions per arm, applied conventional exclusion rules for bots and anomalous sessions, and that independence/additivity of the three scaling factors was quantified via the controlled offline experiments (one-factor-at-a-time ablations) reported in Section 4. Full experimental controls, procedures, and per-factor results remain in Sections 4–5. We believe these concise additions will address the concern while respecting abstract length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical measurements

full rationale

The paper reports an empirical scaling study on production logs and live A/B tests for search CVR models. The central finding—that backbone, embedding, and data scaling effects are largely independent and additive—is obtained by measuring quality changes across multiple SOTA architectures and ensembles on a fixed large dataset. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are invoked to establish this; the result is a direct observation from the experiments described. The deployment and +2.6% A/B gain are likewise validated externally to any internal model. This is self-contained empirical work with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review identifies no free parameters, axioms, or invented entities; the work consists of empirical evaluation of scaling factors on production data.

pith-pipeline@v0.9.1-grok · 5847 in / 1240 out tokens · 39575 ms · 2026-06-29T06:01:48.541611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages · 7 internal anchors

[1]

Que2search: fast and accurate query and document understanding for search at Facebook

Andrés Abeliuk, Behrooz Lari, Anthony Sun, Xinyu Liu, Shobana Jameel, and Wei Chang. Que2search: fast and accurate query and document understanding for search at Facebook. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2455–2465. ACM, 2020

2020
[2]

An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance

Sebastian Bruch, Xuanhui Wang, Michael Bendersky, and Marc Najork. An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance. InProceedings of the 2019 ACM SIGIR international conference on theory of information retrieval, pages 75–78, 2019

2019
[3]

Wide & deep learning for recommender systems

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016

2016
[4]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapoli...

2019
[5]

CLD3: Google’s compact language detector 3, 2019

Daniel Gillick, David Ryan, Shankar Kumar, Andrew Kunchukuttan, and Colin Cherry. CLD3: Google’s compact language detector 3, 2019. 13

2019
[6]

DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Practical lessons from predicting clicks on ads at facebook

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at facebook. InProceedings of the eighth international workshop on data mining for online advertising, pages 1–9, 2014

2014
[8]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Hyformer: Revisiting the roles of sequence modeling and feature interaction in ctr prediction, 2026

Yunwen Huang, Shiyong Hong, Xijun Xiao, Jinqiu Jin, Xuanyuan Luo, Zhe Wang, Zheng Chai, Shikang Wu, Yuchao Zheng, and Jingjian Lin. Hyformer: Revisiting the roles of sequence modeling and feature interaction in ctr prediction, 2026

2026
[10]

Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, and Ed H. Chi. Deep hash embeddings for efficient registry-free recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 798–808, New York, NY , USA, 2021. ACM

2021
[11]

Learning to embed categorical features without embedding tables for recommendation

Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, and Ed H Chi. Learning to embed categorical features without embedding tables for recommendation. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 840–850, 2021

2021
[12]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939. Association for Computing Machinery (ACM), 2018

1930
[15]

Ad click prediction: a view from the trenches

H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1222–1230, 2013

2013
[17]

Deep Learning Recommendation Model for Personalization and Recommendation Systems

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sun- daraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[18]

Contrastive learning for conversion rate prediction

Wentao Ouyang, Rui Dong, Xiuwu Zhang, Chaofeng Guo, Jinmei Luo, Xiangzheng Liu, and Yanlong Du. Contrastive learning for conversion rate prediction. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1909–1913, 2023. 14

1909
[19]

Metacvr: Conversion rate prediction via meta learning in small-scale recommendation scenarios

Xiaofeng Pan, Ming Li, Jing Zhang, Keren Yu, Hong Wen, Luping Wang, Chengjun Mao, and Bo Cao. Metacvr: Conversion rate prediction via meta learning in small-scale recommendation scenarios. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2110–2114, 2022

2022
[20]

Are neural rankers still outperformed by gradient boosted decision trees? InInternational Conference on Learning Representations, 2021

Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, and Marc Najork. Are neural rankers still outperformed by gradient boosted decision trees? InInternational Conference on Learning Representations, 2021

2021
[21]

Compositional embeddings using complementary partitions for memory-efficient recommendation systems

Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 165–175, New York, NY , USA, 2020. ACM

2020
[22]

Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InPro- ceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019

2019
[23]

The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

2019
[24]

Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations

Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. InProceedings of the 14th ACM Conference on Recommender Systems, pages 269–278, New York, NY , USA,
[25]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[26]

Deep & cross network for ad click predictions

Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17, pages 1–7. 2017

2017
[27]

DCN V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. DCN V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of The Web Conference 2021, pages 1785–1797, 2021

2021
[28]

Masknet: Introducing feature-wise multi- plication to ctr ranking models by instance-guided mask.arXiv preprint arXiv:2102.07619, 2021

Zhiqiang Wang, Qingyun She, and Junlin Zhang. Masknet: Introducing feature-wise multi- plication to ctr ranking models by instance-guided mask.arXiv preprint arXiv:2102.07619, 2021

work page arXiv 2021
[29]

Climber: Toward efficient scaling laws for large recommendation models.arXiv preprint arXiv:2502.09888, 2025

Songpei Xu, Shuo Wang, Daning Guo, Xinxin Guo, Qiang Xiao, Bi Huang, Guohao Wu, and Chen Luo. Climber: Toward efficient scaling laws for large recommendation models.arXiv preprint arXiv:2502.09888, 2025

work page arXiv 2025
[30]

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Bencheng Yan, Yuejie Lei, Zhiyuan Zeng, Di Wang, Kaiyi Lin, Pengjie Wang, Jian Xu, and Bo Zheng. From scaling to structured expressivity: Rethinking transformers for ctr prediction. arXiv preprint arXiv:2511.12081, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Wukong: Towards a scaling law for large-scale recommendation, 2024

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. Wukong: Towards a scaling law for large-scale recommendation, 2024

2024
[33]

Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022. 15

work page arXiv 2022
[34]

Deep interest evolution network for click-through rate prediction

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 5941–5948, 2019

2019
[35]

Rankmixer: Scaling up ranking models in industrial recommenders.arXiv preprint arXiv:2507.15551, 2025

Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. Rankmixer: Scaling up ranking models in industrial recommenders.arXiv preprint arXiv:2507.15551, 2025

work page arXiv 2025
[36]

Open benchmarking for click-through rate prediction

Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. InProceedings of the 30th ACM international conference on information & knowledge management, pages 2759–2769, 2021

2021
[37]

On the practice of deep hierarchical ensemble network for ad conversion rate prediction

Jinfeng Zhuang, Yinrui Li, Runze Su, Ke Xu, Zhixuan Shao, Kungang Li, Ling Leng, Han Sun, Meng Qi, Yixiong Meng, et al. On the practice of deep hierarchical ensemble network for ad conversion rate prediction. InCompanion Proceedings of the ACM on Web Conference 2025, pages 671–680, 2025. 16

2025

[1] [1]

Que2search: fast and accurate query and document understanding for search at Facebook

Andrés Abeliuk, Behrooz Lari, Anthony Sun, Xinyu Liu, Shobana Jameel, and Wei Chang. Que2search: fast and accurate query and document understanding for search at Facebook. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2455–2465. ACM, 2020

2020

[2] [2]

An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance

Sebastian Bruch, Xuanhui Wang, Michael Bendersky, and Marc Najork. An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance. InProceedings of the 2019 ACM SIGIR international conference on theory of information retrieval, pages 75–78, 2019

2019

[3] [3]

Wide & deep learning for recommender systems

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016

2016

[4] [4]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapoli...

2019

[5] [5]

CLD3: Google’s compact language detector 3, 2019

Daniel Gillick, David Ryan, Shankar Kumar, Andrew Kunchukuttan, and Colin Cherry. CLD3: Google’s compact language detector 3, 2019. 13

2019

[6] [6]

DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Practical lessons from predicting clicks on ads at facebook

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at facebook. InProceedings of the eighth international workshop on data mining for online advertising, pages 1–9, 2014

2014

[8] [8]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Hyformer: Revisiting the roles of sequence modeling and feature interaction in ctr prediction, 2026

Yunwen Huang, Shiyong Hong, Xijun Xiao, Jinqiu Jin, Xuanyuan Luo, Zhe Wang, Zheng Chai, Shikang Wu, Yuchao Zheng, and Jingjian Lin. Hyformer: Revisiting the roles of sequence modeling and feature interaction in ctr prediction, 2026

2026

[10] [10]

Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, and Ed H. Chi. Deep hash embeddings for efficient registry-free recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 798–808, New York, NY , USA, 2021. ACM

2021

[11] [11]

Learning to embed categorical features without embedding tables for recommendation

Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, and Ed H Chi. Learning to embed categorical features without embedding tables for recommendation. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 840–850, 2021

2021

[12] [12]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[13] [13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939. Association for Computing Machinery (ACM), 2018

1930

[15] [15]

Ad click prediction: a view from the trenches

H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1222–1230, 2013

2013

[16] [17]

Deep Learning Recommendation Model for Personalization and Recommendation Systems

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sun- daraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[17] [18]

Contrastive learning for conversion rate prediction

Wentao Ouyang, Rui Dong, Xiuwu Zhang, Chaofeng Guo, Jinmei Luo, Xiangzheng Liu, and Yanlong Du. Contrastive learning for conversion rate prediction. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1909–1913, 2023. 14

1909

[18] [19]

Metacvr: Conversion rate prediction via meta learning in small-scale recommendation scenarios

Xiaofeng Pan, Ming Li, Jing Zhang, Keren Yu, Hong Wen, Luping Wang, Chengjun Mao, and Bo Cao. Metacvr: Conversion rate prediction via meta learning in small-scale recommendation scenarios. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2110–2114, 2022

2022

[19] [20]

Are neural rankers still outperformed by gradient boosted decision trees? InInternational Conference on Learning Representations, 2021

Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, and Marc Najork. Are neural rankers still outperformed by gradient boosted decision trees? InInternational Conference on Learning Representations, 2021

2021

[20] [21]

Compositional embeddings using complementary partitions for memory-efficient recommendation systems

Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 165–175, New York, NY , USA, 2020. ACM

2020

[21] [22]

Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InPro- ceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019

2019

[22] [23]

The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

2019

[23] [24]

Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations

Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. InProceedings of the 14th ACM Conference on Recommender Systems, pages 269–278, New York, NY , USA,

[24] [25]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[25] [26]

Deep & cross network for ad click predictions

Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17, pages 1–7. 2017

2017

[26] [27]

DCN V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. DCN V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of The Web Conference 2021, pages 1785–1797, 2021

2021

[27] [28]

Masknet: Introducing feature-wise multi- plication to ctr ranking models by instance-guided mask.arXiv preprint arXiv:2102.07619, 2021

Zhiqiang Wang, Qingyun She, and Junlin Zhang. Masknet: Introducing feature-wise multi- plication to ctr ranking models by instance-guided mask.arXiv preprint arXiv:2102.07619, 2021

work page arXiv 2021

[28] [29]

Climber: Toward efficient scaling laws for large recommendation models.arXiv preprint arXiv:2502.09888, 2025

Songpei Xu, Shuo Wang, Daning Guo, Xinxin Guo, Qiang Xiao, Bi Huang, Guohao Wu, and Chen Luo. Climber: Toward efficient scaling laws for large recommendation models.arXiv preprint arXiv:2502.09888, 2025

work page arXiv 2025

[29] [30]

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Bencheng Yan, Yuejie Lei, Zhiyuan Zeng, Di Wang, Kaiyi Lin, Pengjie Wang, Jian Xu, and Bo Zheng. From scaling to structured expressivity: Rethinking transformers for ctr prediction. arXiv preprint arXiv:2511.12081, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [31]

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [32]

Wukong: Towards a scaling law for large-scale recommendation, 2024

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. Wukong: Towards a scaling law for large-scale recommendation, 2024

2024

[32] [33]

Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022. 15

work page arXiv 2022

[33] [34]

Deep interest evolution network for click-through rate prediction

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 5941–5948, 2019

2019

[34] [35]

Rankmixer: Scaling up ranking models in industrial recommenders.arXiv preprint arXiv:2507.15551, 2025

Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. Rankmixer: Scaling up ranking models in industrial recommenders.arXiv preprint arXiv:2507.15551, 2025

work page arXiv 2025

[35] [36]

Open benchmarking for click-through rate prediction

Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. InProceedings of the 30th ACM international conference on information & knowledge management, pages 2759–2769, 2021

2021

[36] [37]

On the practice of deep hierarchical ensemble network for ad conversion rate prediction

Jinfeng Zhuang, Yinrui Li, Runze Su, Ke Xu, Zhixuan Shao, Kungang Li, Ling Leng, Han Sun, Meng Qi, Yixiong Meng, et al. On the practice of deep hierarchical ensemble network for ad conversion rate prediction. InCompanion Proceedings of the ACM on Web Conference 2025, pages 671–680, 2025. 16

2025