Res-embedding for Deep Learning Based Click-Through Rate Prediction Modeling

Guorui Zhou; Kailun Wu; Kun Gai; Weijie Bian; Xiaoqiang Zhu; Zhao Yang

arxiv: 1906.10304 · v1 · pith:ZYUQDC6Pnew · submitted 2019-06-25 · 📊 stat.ML · cs.LG

Res-embedding for Deep Learning Based Click-Through Rate Prediction Modeling

Guorui Zhou , Kailun Wu , Weijie Bian , Zhao Yang , Xiaoqiang Zhu , Kun Gai This is my paper

Pith reviewed 2026-05-25 16:43 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords click-through rate predictionCTR modelembeddingdeep learninggeneralizationinterest domainres-embedding

0 comments

The pith

Small aggregation radius among embeddings of items in the same interest domain leads to better generalization in deep CTR prediction models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models user behavior via an interest delay model to define interest domains for items. It proves that embeddings of items sharing a domain should have small aggregation radius to improve generalization of deep CTR models that follow the embedding-plus-MLP structure. The authors then introduce res-embedding, in which each item's vector is the sum of a central component drawn from an item-based interest graph and a comparatively small residual component. Experiments on several public datasets show that the new structure raises model performance.

Core claim

By modeling user behavior with an interest delay model, we prove that small aggregation radius of embedding vectors for items in the same interest domain results in good generalization performance of deep CTR models. We then design res-embedding where each item's vector is the sum of a central embedding from an item-based interest graph and a small residual embedding vector.

What carries the argument

Res-embedding, in which each item's embedding vector equals a central embedding from an item-based interest graph plus a small residual embedding vector.

If this is right

Deep CTR models generalize better when embedding vectors of items that share an interest domain stay close together.
Res-embedding improves performance by separating a graph-derived central vector from a small residual vector without altering the rest of the network.
The interest delay model supplies an explicit way to group items into domains before embedding.
Empirical gains appear on multiple public datasets when the res-embedding structure replaces ordinary embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of central and residual components could be tested in other embedding-heavy tasks such as next-item recommendation.
Scaling the interest graph construction to much larger item sets might expose computational trade-offs not visible on the public datasets.
If the interest delay model is replaced by an alternative grouping method, the same small-radius requirement could be re-checked for continued validity.

Load-bearing premise

User behavior can be modeled with an interest delay model that defines clear interest domains for items.

What would settle it

Run the same deep CTR models on the public datasets with standard embeddings that deliberately produce large aggregation radii within interest domains and measure whether generalization metrics remain as high as with res-embedding.

Figures

Figures reproduced from arXiv: 1906.10304 by Guorui Zhou, Kailun Wu, Kun Gai, Weijie Bian, Xiaoqiang Zhu, Zhao Yang.

**Figure 2.** Figure 2: e upper part of this gure is the case of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sketch Map of the basic res-embedding prototype [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: e le part of this gure is the items graph [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: e proportion of to whole training set varies from [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Compare the original embedding and the res [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: e blue dots denote embedding vectors of the top [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: e relationship between the scale of the residual [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Recently, click-through rate (CTR) prediction models have evolved from shallow methods to deep neural networks. Most deep CTR models follow an Embedding\&MLP paradigm, that is, first mapping discrete id features, e.g. user visited items, into low dimensional vectors with an embedding module, then learn a multi-layer perception (MLP) to fit the target. In this way, embedding module performs as the representative learning and plays a key role in the model performance. However, in many real-world applications, deep CTR model often suffers from poor generalization performance, which is mostly due to the learning of embedding parameters. In this paper, we model user behavior using an interest delay model, study carefully the embedding mechanism, and obtain two important results: (i) We theoretically prove that small aggregation radius of embedding vectors of items which belongs to a same user interest domain will result in good generalization performance of deep CTR model. (ii) Following our theoretical analysis, we design a new embedding structure named res-embedding. In res-embedding module, embedding vector of each item is the sum of two components: (i) a central embedding vector calculated from an item-based interest graph (ii) a residual embedding vector with its scale to be relatively small. Empirical evaluation on several public datasets demonstrates the effectiveness of the proposed res-embedding structure, which brings significant improvement on the model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Res-embedding is a simple graph-plus-residual tweak for CTR embeddings whose theoretical backing depends on an unvalidated interest delay model.

read the letter

The main thing to know is that the claimed theoretical result on aggregation radius and generalization only holds inside the interest delay model the authors introduce; outside that assumption the link to real Embedding+MLP behavior is not established. They build an item-based interest graph to produce a central embedding per item, add a small residual vector, and argue this keeps embeddings of items in the same domain close, which their proof says improves generalization. On public datasets they report better AUC and logloss than standard baselines. That empirical part is straightforward to check and the public data choice helps. The res-embedding structure itself is easy to implement and could be worth testing in production CTR pipelines if the graph construction is cheap. The soft spot is the delay model. It supplies the definition of interest domains and the radius metric, yet the abstract gives no evidence that it matches observed co-occurrence patterns, no comparison to simpler behavior models, and no sensitivity checks. If the model is off, the radius-generalization statement does not transfer. The paper also does not discuss how the interest graph is built from data or how the residual scale is chosen in practice. This work is mainly for people already running large-scale CTR systems who want a modest embedding change with some theoretical story attached. A referee could usefully verify the derivation under the stated assumptions and re-run the experiments with the graph details made explicit. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper models user behavior via an interest delay model that partitions items into domains, proves that small aggregation radius of embeddings within the same domain yields good generalization for Embedding+MLP CTR models, and introduces a res-embedding structure whose vectors are the sum of a central embedding (derived from an item-based interest graph) and a small residual embedding. Experiments on public datasets are reported to show performance gains over baselines.

Significance. If the central theoretical claim holds under realistic conditions, the work supplies a concrete design principle for embedding modules that could improve generalization in production CTR systems. The res-embedding construction is simple to implement and the reported empirical gains on public data are consistent with the claimed benefit, though the magnitude and robustness remain to be confirmed.

major comments (2)

[§3–4] The theoretical result (abstract and §3–4) is derived entirely under the interest delay model that supplies both the definition of 'same user interest domain' and the aggregation-radius metric. No section provides empirical validation of the delay model against observed item co-occurrence statistics, alternative behavioral models, or sensitivity checks; if the model does not capture real CTR data structure, the radius–generalization link does not transfer to deployed models.
[§5] The res-embedding construction (abstract and §5) defines the central component from an 'item-based interest graph' whose construction details and hyper-parameters are not shown to be independent of the same delay-model assumptions used in the proof; this creates a potential circularity between the theoretical justification and the practical implementation.

minor comments (2)

Notation for aggregation radius and interest domains should be defined once with a single equation rather than re-introduced in multiple sections.
The experimental section should report the exact public datasets used, the precise baseline implementations, and statistical significance of the reported AUC/log-loss improvements.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on the theoretical assumptions and implementation details. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§3–4] The theoretical result (abstract and §3–4) is derived entirely under the interest delay model that supplies both the definition of 'same user interest domain' and the aggregation-radius metric. No section provides empirical validation of the delay model against observed item co-occurrence statistics, alternative behavioral models, or sensitivity checks; if the model does not capture real CTR data structure, the radius–generalization link does not transfer to deployed models.

Authors: The interest delay model serves as a formal abstraction to define user interest domains and to prove the aggregation-radius property for Embedding+MLP models. The proof establishes a sufficient condition under this model; the res-embedding construction is then motivated by the resulting design principle and is evaluated directly on real public CTR datasets. We acknowledge that the manuscript does not contain empirical checks of the delay model itself. In revision we will add an explicit discussion of the modeling assumptions and their scope, but a full empirical validation of the delay model lies outside the current scope. revision: partial
Referee: [§5] The res-embedding construction (abstract and §5) defines the central component from an 'item-based interest graph' whose construction details and hyper-parameters are not shown to be independent of the same delay-model assumptions used in the proof; this creates a potential circularity between the theoretical justification and the practical implementation.

Authors: The item-based interest graph is built from observed item co-occurrence counts in user histories, using standard graph-construction techniques that do not invoke the interest delay model. The delay model appears only in the theoretical analysis of §3–4. In the revised manuscript we will expand §5 with the precise graph-construction procedure, edge-weighting rule, and hyper-parameter choices to make the independence explicit. revision: yes

standing simulated objections not resolved

Empirical validation of the interest delay model against observed item co-occurrence statistics, alternative behavioral models, or sensitivity checks

Circularity Check

0 steps flagged

Theoretical result derived under explicit modeling assumption; no reduction to inputs by construction

full rationale

The paper introduces an interest delay model as a modeling assumption that partitions items into domains, then states a theoretical proof relating aggregation radius (defined inside that model) to generalization. No equations, self-citations, or fitted parameters are shown that make the claimed result equivalent to its inputs by definition. The empirical results use public datasets and are independent of the theoretical step. This is a standard modeling assumption plus proof structure with no detected circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the interest delay model as a domain assumption and the new res-embedding as an invented structure without independent evidence provided in the abstract.

axioms (1)

domain assumption User behavior can be modeled using an interest delay model.
This is invoked to study the embedding mechanism and derive the theoretical result.

invented entities (1)

res-embedding structure no independent evidence
purpose: To improve generalization by separating central and residual embeddings.
Introduced in the paper as a new embedding module.

pith-pipeline@v0.9.0 · 5789 in / 1011 out tokens · 24617 ms · 2026-05-25T16:43:37.842094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative /f_iltering. InMachine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on . IEEE, 1–6

work page 2016
[2]

Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral tech- niques for embedding and clustering. Advances in Neural Information Processing Systems 14, 6 (2002), 585–591

work page 2002
[3]

Yoshua Bengio, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, 6 (2003), 1137–1155

work page 2003
[4]

Heng Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, and Mustafa Ispir

work page
[5]

(2016), 7–10

Wide & Deep Learning for Recommender Systems. (2016), 7–10

work page 2016
[6]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In ACM Conference on Recommender Systems . 191–198

work page 2016
[7]

Tom Fawce/t_t. 2005. An introduction to ROC analysis.Pa/t_tern Recognition Le/t_ters 27, 8 (2005), 861–874

work page 2005
[8]

Kun Gai, Xiaoqiang Zhu, Han Li, Kai Liu, and Zhe Wang. 2017. Learning Piece- wise Linear Models from Large Scale Data for Ad Click Prediction. (2017)

work page 2017
[9]

Aditya Grover and Jure Leskovec. 2016. node2vec:Scalable Feature Learning for Networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864

work page 2016
[10]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. (2017), 1725–1731

work page 2017
[11]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. 2015. /T_he MovieLens Datasets: History and Context. ACM. 19 pages

work page 2015
[12]

Ruining He and Julian Mcauley. 2016. Ups and Downs: Modeling the Visual Evo- lution of Fashion Trends with One-Class Collaborative Filtering. In International Conference on World Wide Web. 507–517

work page 2016
[13]

Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. 2018. Generalization in Deep Learning. (2018)

work page 2018
[14]

/T_homas N Kipf and Max Welling. 2016. Semi-Supervised Classi/f_ication with Graph Convolutional Networks. (2016)

work page 2016
[15]

Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436

work page 2015
[16]

Mehul Parsana, Krishna Poola, Yajun Wang, and Zhiguang Wang. 2018. Improv- ing Native Ads CTR Prediction by Large Scale Event Embedding and Recurrent Networks. (2018)

work page 2018
[17]

Yanru /Q_u, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang

work page
[18]

In IEEE International Conference on Data Mining

Product-Based Neural Networks for User Response Prediction. In IEEE International Conference on Data Mining . 1149–1154

work page
[19]

Steﬀen Rendle. 2011. Factorization Machines. In IEEE International Conference on Data Mining. 995–1000

work page 2011
[20]

Roweis and Lawrence K

Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 5500 (2000), 2323–2326

work page 2000
[21]

Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and J

Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and J. C. Mao. 2016. Deep Crossing:Web-Scale Modeling without Manually Cra/f_ted Combinatorial Features. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 255–262

work page 2016
[22]

Jian Tang, Meng /Q_u, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei

work page
[23]

1067–1077

LINE:Large-scale Information Network Embedding. 1067–1077

work page
[24]

Mikhail Tro/f_imov, Sumit Sidana, Oleh Horodnitskii, Charlo/t_te Laclau, Yury Max- imov, and Massih-Reza Amini. 2017. Representation Learning and Pairwise Ranking for Implicit and Explicit Feedback in Recommendation Systems. CoRR (2017)

work page 2017
[25]

Nicolas Usunier, Massih-Reza Amini, and Patrick Gallinari. 2006. Generalization error bounds for classi/f_iers trained with interdependent data. InAdvances in neural information processing systems . 1369–1376

work page 2006
[26]

Huan Xu. 2012. Robustness and generalization. Machine Learning 86, 3 (2012), 391–423

work page 2012
[27]

Valentina Zantedeschi, R ´emi Emonet, and Marc Sebban. 2016. Metric learn- ing as convex combinations of local models with generalization guarantees. In Proceedings of the IEEE Conference on Computer Vision and Pa/t_tern Recognition. 1478–1486

work page 2016
[28]

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2018. Deep Interest Evolution Network for Click-/T_hrough Rate Prediction. (2018)

work page 2018
[29]

Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2017. Deep Interest Network for Click-/T_hrough Rate Prediction. (2017)

work page 2017

[1] [1]

Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative /f_iltering. InMachine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on . IEEE, 1–6

work page 2016

[2] [2]

Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral tech- niques for embedding and clustering. Advances in Neural Information Processing Systems 14, 6 (2002), 585–591

work page 2002

[3] [3]

Yoshua Bengio, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, 6 (2003), 1137–1155

work page 2003

[4] [4]

Heng Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, and Mustafa Ispir

work page

[5] [5]

(2016), 7–10

Wide & Deep Learning for Recommender Systems. (2016), 7–10

work page 2016

[6] [6]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In ACM Conference on Recommender Systems . 191–198

work page 2016

[7] [7]

Tom Fawce/t_t. 2005. An introduction to ROC analysis.Pa/t_tern Recognition Le/t_ters 27, 8 (2005), 861–874

work page 2005

[8] [8]

Kun Gai, Xiaoqiang Zhu, Han Li, Kai Liu, and Zhe Wang. 2017. Learning Piece- wise Linear Models from Large Scale Data for Ad Click Prediction. (2017)

work page 2017

[9] [9]

Aditya Grover and Jure Leskovec. 2016. node2vec:Scalable Feature Learning for Networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864

work page 2016

[10] [10]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. (2017), 1725–1731

work page 2017

[11] [11]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. 2015. /T_he MovieLens Datasets: History and Context. ACM. 19 pages

work page 2015

[12] [12]

Ruining He and Julian Mcauley. 2016. Ups and Downs: Modeling the Visual Evo- lution of Fashion Trends with One-Class Collaborative Filtering. In International Conference on World Wide Web. 507–517

work page 2016

[13] [13]

Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. 2018. Generalization in Deep Learning. (2018)

work page 2018

[14] [14]

/T_homas N Kipf and Max Welling. 2016. Semi-Supervised Classi/f_ication with Graph Convolutional Networks. (2016)

work page 2016

[15] [15]

Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436

work page 2015

[16] [16]

Mehul Parsana, Krishna Poola, Yajun Wang, and Zhiguang Wang. 2018. Improv- ing Native Ads CTR Prediction by Large Scale Event Embedding and Recurrent Networks. (2018)

work page 2018

[17] [17]

Yanru /Q_u, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang

work page

[18] [18]

In IEEE International Conference on Data Mining

Product-Based Neural Networks for User Response Prediction. In IEEE International Conference on Data Mining . 1149–1154

work page

[19] [19]

Steﬀen Rendle. 2011. Factorization Machines. In IEEE International Conference on Data Mining. 995–1000

work page 2011

[20] [20]

Roweis and Lawrence K

Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 5500 (2000), 2323–2326

work page 2000

[21] [21]

Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and J

Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and J. C. Mao. 2016. Deep Crossing:Web-Scale Modeling without Manually Cra/f_ted Combinatorial Features. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 255–262

work page 2016

[22] [22]

Jian Tang, Meng /Q_u, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei

work page

[23] [23]

1067–1077

LINE:Large-scale Information Network Embedding. 1067–1077

work page

[24] [24]

Mikhail Tro/f_imov, Sumit Sidana, Oleh Horodnitskii, Charlo/t_te Laclau, Yury Max- imov, and Massih-Reza Amini. 2017. Representation Learning and Pairwise Ranking for Implicit and Explicit Feedback in Recommendation Systems. CoRR (2017)

work page 2017

[25] [25]

Nicolas Usunier, Massih-Reza Amini, and Patrick Gallinari. 2006. Generalization error bounds for classi/f_iers trained with interdependent data. InAdvances in neural information processing systems . 1369–1376

work page 2006

[26] [26]

Huan Xu. 2012. Robustness and generalization. Machine Learning 86, 3 (2012), 391–423

work page 2012

[27] [27]

Valentina Zantedeschi, R ´emi Emonet, and Marc Sebban. 2016. Metric learn- ing as convex combinations of local models with generalization guarantees. In Proceedings of the IEEE Conference on Computer Vision and Pa/t_tern Recognition. 1478–1486

work page 2016

[28] [28]

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2018. Deep Interest Evolution Network for Click-/T_hrough Rate Prediction. (2018)

work page 2018

[29] [29]

Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2017. Deep Interest Network for Click-/T_hrough Rate Prediction. (2017)

work page 2017