arxiv: 2604.26489 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.IR

Recognition: unknown

Understanding DNNs in Feature Interaction Models: A Dimensional Collapse Perspective

Jiancheng Wang , Mingjia Yin , Hao Wang , Enhong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:30 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords dimensional collapseDNNsfeature interaction modelsembeddingsrecommendation systemsgradient analysisparallel DNNsstacked DNNs

0 comments

The pith

DNNs mitigate dimensional collapse of embeddings in feature interaction models rather than capturing high-order interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes the longstanding debate on DNN roles in recommendation models that rely on feature interactions. Instead of assuming DNNs implicitly learn high-order interactions, experiments on both parallel and stacked configurations show they prevent embeddings from losing effective dimensionality. A supporting gradient analysis identifies the collapse mechanism and how DNN components counteract it. This view matters because it explains observed performance gains without requiring strong interaction-learning assumptions. Readers gain a concrete alternative: DNNs act as stabilizers for representation robustness in embedding spaces.

Core claim

Both parallel and stacked DNNs effectively mitigate the dimensional collapse of embeddings in feature interaction recommendation models. Gradient-based theoretical analysis, supported by empirical evidence from overall and ablation studies, uncovers the underlying mechanisms of dimensional collapse and its mitigation by DNN structures.

What carries the argument

Dimensional collapse of embeddings, where representations lose effective dimensionality, mitigated by parallel or stacked DNN components through gradient dynamics.

If this is right

Reconciles prior conflicting claims by showing DNN benefits arise from collapse mitigation instead of high-order interaction capture.
Model improvements in these systems can target embedding stability directly without assuming interaction-order learning.
Ablation studies should separate collapse effects from other DNN contributions to isolate performance drivers.
Design of future feature interaction models can prioritize DNN placements that preserve embedding dimensionality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simpler techniques like targeted regularization on embedding norms might replicate DNN benefits at lower cost.
The collapse-mitigation view could apply to embedding-based models outside recommendations, such as in language or graph tasks.
Direct measurement of embedding rank or effective dimension before and after DNN addition would test the mechanism cleanly.
Performance attribution studies could quantify the fraction of gains due to collapse reduction versus added parameters.

Load-bearing premise

Observed mitigation stems specifically from improved dimensional robustness of representations rather than general increases in model capacity or unmeasured factors.

What would settle it

An experiment that matches DNN and non-DNN model capacity while measuring embedding dimensionality metrics and finds no reduction in collapse or performance gain from the DNN.

Figures

Figures reproduced from arXiv: 2604.26489 by Enhong Chen, Hao Wang, Jiancheng Wang, Mingjia Yin.

**Figure 1.** Figure 1: We investigate the role of DNN in feature interaction mod view at source ↗

**Figure 2.** Figure 2: Illustration of DNN integration in feature interaction models: (a) depicts parallel DNN structure, (b) represents stacked DNN structure. view at source ↗

**Figure 3.** Figure 3: Singular spectrum of ablation variants on feature interac view at source ↗

**Figure 4.** Figure 4: Spectral Analysis of Gradients on Avazu. view at source ↗

read the original abstract

DNNs have gained widespread adoption in feature interaction recommendation models. However, there has been a longstanding debate on their roles. On one hand, some works claim that DNNs possess the ability to implicitly capture high-order feature interactions. Conversely, recent studies have highlighted the limitations of DNNs in effectively learning dot products, specifically second-order interactions, let alone higher-order interactions. In this paper, we present a novel perspective to understand the effectiveness of DNNs: their impact on the dimensional robustness of the representations. In particular, we conduct extensive experiments involving both parallel DNNs and stacked DNNs. Our evaluation encompasses an overall study of complete DNN on two feature interaction models, alongside a fine-grained ablation analysis of components within DNNs. Experimental results demonstrate that both parallel and stacked DNNs can effectively mitigate the dimensional collapse of embeddings. Furthermore, a gradient-based theoretical analysis, supported by empirical evidence, uncovers the underlying mechanisms of dimensional collapse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes DNNs in feature interaction models as mitigating embedding dimensional collapse, with experiments and gradient analysis, but the attribution is undercut by missing capacity controls.

read the letter

Colleague, the punchline on this paper is that DNNs help in recommendation feature interaction models by reducing dimensional collapse in embeddings rather than by learning high-order interactions. They back this with experiments on parallel and stacked DNNs across two models, component ablations, and a gradient-based theoretical sketch with empirical checks. Both setups appear to show mitigation, which is the concrete result they emphasize. The work does a few things solidly. The combination of overall DNN evaluation and fine-grained ablations on parts of the network gives a more detailed empirical picture than most prior claims in the debate. Trying to tie the observations to gradient behavior adds a mechanistic angle that goes beyond pure black-box testing. That structure is useful for readers who want to understand internals rather than just accuracy numbers. The soft spots are real but not fatal. The central claim needs the mitigation to come specifically from effects on representation robustness, yet adding DNN components necessarily increases parameters, adds non-linearities, and changes gradient flow. Without capacity-matched controls such as wider linear layers or adjusted embedding sizes that hold total parameters fixed, the reductions in collapse could easily trace to generic capacity or optimization benefits instead. The stress-test note is on target here; the ablations are detailed but do not close that gap. The gradient analysis is a positive step if the assumptions are tight, but the abstract leaves its generality unclear. This is for the recommendation systems subfield, especially people building or dissecting feature interaction models. It directly engages the existing back-and-forth on DNN roles, so practitioners and researchers in that area could extract practical design hints even if the mechanism needs more work. I would send it to peer review. The framing is new enough in context, the experiments deliver actual data, and the main issues are fixable with tighter controls and clearer derivations rather than a rewrite from scratch.

Referee Report

2 major / 2 minor

Summary. The paper claims that DNNs in feature interaction recommendation models mitigate the dimensional collapse of embeddings by enhancing the dimensional robustness of representations. This is demonstrated through extensive experiments on parallel and stacked DNNs in two feature interaction models, including overall studies and fine-grained ablations, along with a gradient-based theoretical analysis supported by empirical evidence.

Significance. If the central claim holds, this work provides a novel perspective on the longstanding debate regarding the role of DNNs in capturing feature interactions, shifting focus from high-order interaction learning to representation robustness. This could have implications for designing more effective recommendation models and understanding embedding behaviors in deep learning.

major comments (2)

[Experimental Evaluation] The experimental evaluation (overall study and fine-grained ablations) does not include capacity-matched controls. Adding parallel or stacked DNNs necessarily increases parameter count and introduces non-linear transformations, yet no comparisons are made to equivalent-capacity baselines such as wider linear layers or expanded embedding dimensions. Without these, reductions in dimensional collapse cannot be securely attributed to the claimed dimensional-robustness mechanism rather than generic capacity or optimization effects.
[Gradient-based Theoretical Analysis] The gradient-based theoretical analysis relies on assumptions about gradient flow through the embeddings and DNN components. The manuscript does not verify that these assumptions remain valid when isolating the DNN contribution from overall model capacity changes, leaving the mechanistic explanation vulnerable to the same confound identified in the experiments.

minor comments (2)

[Abstract] The abstract refers to 'two feature interaction models' without naming them (e.g., FM, DeepFM, or similar); explicit identification would improve readability and allow immediate contextualization of the results.
[Notation and Definitions] Notation for the dimensional collapse metric and embedding dimensions should be defined more explicitly at first use to avoid ambiguity in the theoretical and experimental sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important considerations for strengthening the experimental and theoretical claims. We address each major comment below and will revise the manuscript to incorporate capacity-controlled experiments and additional validation of the gradient analysis.

read point-by-point responses

Referee: [Experimental Evaluation] The experimental evaluation (overall study and fine-grained ablations) does not include capacity-matched controls. Adding parallel or stacked DNNs necessarily increases parameter count and introduces non-linear transformations, yet no comparisons are made to equivalent-capacity baselines such as wider linear layers or expanded embedding dimensions. Without these, reductions in dimensional collapse cannot be securely attributed to the claimed dimensional-robustness mechanism rather than generic capacity or optimization effects.

Authors: We agree that the absence of capacity-matched controls represents a limitation in the current experimental design. While the fine-grained ablations vary DNN components (e.g., depth, width, and activations) within fixed overall architectures, they do not explicitly match total parameter counts against non-DNN alternatives. To address this, we will add in the revised manuscript new baselines that control for capacity, including (i) linear layers with parameter counts matched to the DNN variants and (ii) models with expanded embedding dimensions that achieve equivalent effective capacity without non-linear transformations. These will be evaluated using the same dimensional collapse metrics on the two feature interaction models, allowing clearer attribution to the robustness mechanism. revision: yes
Referee: [Gradient-based Theoretical Analysis] The gradient-based theoretical analysis relies on assumptions about gradient flow through the embeddings and DNN components. The manuscript does not verify that these assumptions remain valid when isolating the DNN contribution from overall model capacity changes, leaving the mechanistic explanation vulnerable to the same confound identified in the experiments.

Authors: The gradient analysis applies the chain rule to the embedding gradients, accounting for the multiplicative effect of the DNN Jacobian. Empirical measurements of gradient norms and directional alignment in our experiments support the derived mechanism. We acknowledge that the assumptions have not been explicitly re-validated under capacity-matched conditions, creating a potential confound. In the revision, we will extend the analysis with additional derivations and simulations that hold total parameter count fixed (comparing DNN paths to linear equivalents) and include corresponding empirical checks on gradient statistics across the same datasets and models. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on independent experiments and analysis

full rationale

The paper grounds its central claims in extensive empirical evaluations of parallel and stacked DNNs on feature interaction models, plus a separate gradient-based theoretical analysis of mechanisms. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain. The mitigation of dimensional collapse is presented as an observed outcome supported by ablations and theory, not as a tautological renaming or prediction forced by the inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Paper rests on standard assumptions about embedding training and gradient flow in DNNs; no new free parameters or invented entities are evident from abstract.

axioms (1)

domain assumption Embeddings are updated via gradient descent in feature interaction models
Invoked in the gradient-based theoretical analysis of dimensional collapse mechanisms.

pith-pipeline@v0.9.0 · 5460 in / 1003 out tokens · 40050 ms · 2026-05-07T13:30:34.013983+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Avazu Dataset

2014. Avazu Dataset. https://www.kaggle.com/competitions/avazu-ctr- prediction/data

2014
[2]

Criteo Dataset

2014. Criteo Dataset. https://www.kaggle.com/c/criteo-display-ad-challenge/ data

2014
[3]

Adrien Bardes, Jean Ponce, and Yann LeCun. 2021. Vicreg: Variance- invariance-covariance regularization for self-supervised learning.arXiv preprint arXiv:2105.04906(2021)

work page internal anchor Pith review arXiv 2021
[4]

Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. 2018. Latent cross: Making use of context in recurrent recommender systems. InProceedings of the eleventh ACM international conference on web search and data mining. 46–54

2018
[5]

Huiyuan Chen, Vivian Lai, Hongye Jin, Zhimeng Jiang, Mahashweta Das, and Xia Hu. 2024. Towards mitigating dimensional collapse of representations in collaborative filtering. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 106–115

2024
[6]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al
[7]

InProceedings of the 1st workshop on deep learning for recommender systems

Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems. 7–10
[8]

Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann Lecun. 2023. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. InInternational conference on machine learning. PMLR, 10929–10974

2023
[9]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction.arXiv preprint arXiv:1703.04247(2017)

work page arXiv 2017
[10]

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. On the Embedding Collapse when Scaling up Recommendation Models.International Conference on Machine Learning (ICML)(2024)

2024
[11]

Junlin He, Jinxiao Du, and Wei Ma. 2024. Preventing dimensional collapse in self-supervised learning via orthogonality regularization.Advances in Neural Information Processing Systems37 (2024), 95579–95606

2024
[12]

Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse predictive analytics. InProceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 355–364

2017
[13]

Tongwen Huang, Zhiqi Zhang, and Junlin Zhang. 2019. FiBiNET: combining fea- ture importance and bilinear feature interaction for click-through rate prediction. InProceedings of the 13th ACM Conference on Recommender Systems. 169–177

2019
[14]

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2021. Understanding Dimensional Collapse in Contrastive Self-supervised Learning. InICLR

2021
[15]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in- teractions for recommender systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). 1754–1763

2018
[16]

Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, and Zhenhua Dong
[17]

FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction.arXiv preprint arXiv:2304.00902(2023)

work page arXiv 2023
[18]

Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, and Xiuqiang He. 2021. SimpleX: A simple and strong baseline for collaborative filtering. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 1243–1252

2021
[19]

Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. 2018. Field-weighted factorization machines for click-through rate prediction in display advertising. InProceedings of the 2018 World Wide Web Conference (WWW). 1349–1357

2018
[20]

Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ads recommendation in a collapsed SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Jiancheng Wang, Mingjia Yin, Hao Wang, and Enhong Chen and entangled world. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery...

2024
[21]

Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2022. Contrastive learning for representation degeneration problem in sequential recommendation. InProceedings of the fifteenth ACM international conference on web search and data mining. 813–823

2022
[22]

Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang
[23]

In2016 IEEE 16th International Conference on Data Mining (ICDM)

Product-based neural networks for user response prediction. In2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1149–1154
[24]

Steffen Rendle. 2010. Factorization machines. In2010 IEEE International Confer- ence on Data Mining (ICDM). IEEE, 995–1000

2010
[25]

Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural collab- orative filtering vs. matrix factorization revisited. InFourteenth ACM conference on recommender systems. 240–248

2020
[26]

Claude E Shannon. 1951. Prediction and entropy of printed English.Bell system technical journal30, 1 (1951), 50–64

1951
[27]

Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self- attentive neural networks. InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 1161–1170

2019
[28]

Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. 2021. Fm2: Field-matrixed factorization machines for recommender systems. InProceedings of the Web Conference 2021. 2828–2837

2021
[29]

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. DCN-V2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of the Web Conference (WWW). 1785–1797

2021
[30]

Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised graph learning for recommendation. InProceed- ings of the 44th international ACM SIGIR conference on research and development in information retrieval. 726–735

2021
[31]

Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, and Enhong Chen. 2025. From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models.arXiv preprint arXiv:2512.14041 (2025)

work page arXiv 2025
[32]

Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. 2022. Are graph augmentations necessary? simple graph contrastive learning for recommendation. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1294–1303

2022
[33]

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, et al. 2024. Wukong: Towards a Scaling Law for Large-Scale Recommendation.arXiv preprint arXiv:2403.02545(2024)

work page arXiv 2024
[34]

Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recom- mender system: A survey and new perspectives.ACM computing surveys (CSUR) 52, 1 (2019), 1–38

2019
[35]

Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-field categorical data. InEuropean conference on information retrieval. Springer, 45–57

2016
[36]

Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021. Deep learning for click-through rate estimation.arXiv preprint arXiv:2104.10584(2021)

work page arXiv 2021
[37]

Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, Irwin King, et al. 2023. Miti- gating the popularity bias of graph collaborative filtering: A dimensional collapse perspective.Advances in Neural Information Processing Systems36 (2023), 67533– 67550

2023
[38]

Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. 2022. Bars: Towards open benchmarking for recommender systems. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2912–2923

2022
[39]

Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2020. Fux- ictr: An open benchmark for click-through rate prediction.arXiv preprint arXiv:2009.05794(2020)

work page arXiv 2020