Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

Chao Zhou; Gengsheng Xue; Guoming Li; Haijie Gu; Jin Chen; Junwei Pan; Menglin Yang; Shangyu Zhang; Shudong Huang; Wentao Ning

arxiv: 2605.23191 · v1 · pith:VVIQCSUPnew · submitted 2026-05-22 · 💻 cs.LG · cs.IR· cs.NA· math.NA

Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

Guoming Li , Shangyu Zhang , Junwei Pan , Wentao Ning , Jin Chen , Gengsheng Xue , Chao Zhou , Shudong Huang

show 2 more authors

Haijie Gu Menglin Yang

This is my paper

Pith reviewed 2026-05-25 05:28 UTC · model grok-4.3

classification 💻 cs.LG cs.IRcs.NAmath.NA

keywords recommendation systemsembedding collapseeffective ranktoken mixingP-FFNscalingRankElastorGLU

0 comments

The pith

RankElastor prevents embedding collapse in scaled recommendation models by using parameterized full mixing and GLU-improved P-FFNs to stabilize effective-rank trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that RankMixer architectures suffer from embedding collapse because rigid token mixing and standard per-token feedforward networks create a damped oscillatory trajectory in effective rank across layers, limiting expressivity. RankElastor addresses this with parameterized full mixing that enables more expressive token interactions with better spectral robustness, plus GLU-improved P-FFNs that stabilize representation spectra. Experiments on large industrial datasets show consistent performance gains, reduced collapse, and more robust scaling. A sympathetic reader would care because the changes allow larger models to use their expanded capacity without representations collapsing to low effective rank.

Core claim

Through empirical analysis and theoretical insights, the authors show that rigid token mixing and P-FFN modules jointly induce a damped oscillatory trajectory in effective-rank evolution across layers, and introduce RankElastor which produces spectrum-robust representations with provable collapse mitigation through parameterized full mixing and GLU-improved P-FFNs.

What carries the argument

Parameterized full mixing for expressive token mixing with improved spectral robustness, combined with GLU-improved P-FFNs for stabilizing representation spectra, as the two components of the RankElastor architecture.

If this is right

Recommendation models maintain higher effective rank and thus greater expressivity as they scale to larger sizes.
Performance improves consistently on large-scale industrial datasets without underutilizing the representation space.
The architecture exhibits robust scaling behavior by avoiding the collapse induced by damped rank oscillations.
Representation spectra can be shaped through architectural choices to support denser scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modifications to token mixing and FFN modules could be tested in other sequence models where effective-rank collapse appears during scaling.
Tracking the effective-rank trajectory during training could become a practical diagnostic tool for detecting impending collapse in recommendation systems.
Varying the degree of parameterization in full mixing might allow further control over rank dynamics beyond the current design.

Load-bearing premise

Rigid token mixing and standard P-FFN modules are the primary causes of the damped oscillatory trajectory in effective-rank evolution that leads to embedding collapse.

What would settle it

Training RankElastor and RankMixer on the same large-scale industrial datasets and finding no measurable increase in effective rank or recommendation performance for RankElastor would disprove the mitigation claim.

Figures

Figures reproduced from arXiv: 2605.23191 by Chao Zhou, Gengsheng Xue, Guoming Li, Haijie Gu, Jin Chen, Junwei Pan, Menglin Yang, Shangyu Zhang, Shudong Huang, Wentao Ning.

**Figure 1.** Figure 1: Distributional shift of effective rank across representation stages in RankMixer . Distributions of per-sample effective [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of average effective rank across repre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architectural overview of (a) RankMixer and (b) RankElastor . [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Efficiency comparison of RankElastor and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Distributional shift of per-sample effective rank across representation stages in RankElastor , from raw embeddings [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Layer-wise comparison of average effective rank, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Dense parameter scaling trends under width (left) and depth (right) scaling for RankMixer and RankElastor . The [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Dense parameter scaling trends under joint width [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textit{embedding collapse}, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbf{damped oscillatory trajectory} in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbf{parameterized full mixing}, which enables expressive token mixing with improved spectral robustness; and (ii) \textbf{GLU-improved P-FFNs}, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: https://github.com/vasile-paskardlgm/RankElastor

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that RankMixer suffers from embedding collapse due to rigid token mixing and standard P-FFN modules inducing a damped oscillatory trajectory in effective-rank evolution. It proposes RankElastor, which uses parameterized full mixing for expressive token interactions and GLU-improved P-FFNs to stabilize spectra, yielding spectrum-robust representations with provable collapse mitigation. Extensive experiments on large-scale industrial datasets show consistent gains in recommendation performance, reduced collapse, and improved scaling behavior, with code released.

Significance. If the empirical gains and theoretical mitigation hold, the work addresses a practical barrier to dense scaling in recommendation systems by improving representation expressivity without collapse. The release of code is a strength for reproducibility in the field.

minor comments (3)

Abstract: the phrase 'provable collapse mitigation' would be strengthened by a parenthetical reference to the specific theorem or analysis section establishing the guarantee.
The transition from the identified damped oscillatory trajectory to the design of the two new modules would benefit from an explicit mapping (e.g., which component addresses which part of the oscillation).
Experimental section: while industrial datasets are mentioned, a brief statement on scale (number of users/items, training steps) would aid readers in assessing the robustness of the scaling claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, recognition of its significance for dense scaling in recommendation systems, and recommendation of minor revision. We appreciate the acknowledgment of the code release for reproducibility.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper identifies embedding collapse via empirical analysis of effective-rank trajectories under rigid mixing and standard P-FFNs, then introduces parameterized full mixing and GLU-improved P-FFNs as architectural remedies. No load-bearing step reduces the claimed mitigation to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain; the theoretical insights are presented as independent analysis rather than derived from the new modules themselves. Experiments on external industrial datasets supply falsifiable validation outside the model's fitted values, keeping the central claim self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5808 in / 941 out tokens · 16515 ms · 2026-05-25T05:28:15.768601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017. The Shattered Gradients Problem: If resnets are the answer, then what is the question?. InProceedings of the 34th Inter- national Conference on Machine Learning (Proceedings of Machine Learning KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea G...

work page 2017
[3]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. 2020. Benign overfitting in linear regression.Proceed- ings of the National Academy of Sciences117, 48 (2020), 30063– 30070. arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.1907378117 doi:10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020
[4]

Don Batory, Peter Höfner, and Jongwook Kim. 2011. Feature interactions, prod- ucts, and composition. InProceedings of the 10th ACM International Conference on Generative Programming and Component Engineering(Portland, Oregon, USA) (GPCE ’11). Association for Computing Machinery, New York, NY, USA, 13–22. doi:10.1145/2047862.2047867

work page doi:10.1145/2047862.2047867 2011
[5]

Huiyuan Chen, Vivian Lai, Hongye Jin, Zhimeng Jiang, Mahashweta Das, and Xia Hu. 2024. Towards Mitigating Dimensional Collapse of Representations in Collab- orative Filtering. InProceedings of the 17th ACM International Conference on Web Search and Data Mining(Merida, Mexico)(WSDM ’24). Association for Computing Machinery, New York, NY, USA, 106–115. doi:...

work page doi:10.1145/3616855.3635832 2024
[6]

Optimal approximate matrix product in terms of stable rank

Michael B. Cohen, Jelani Nelson, and David P. Woodruff. 2016. Optimal ap- proximate matrix product in terms of stable rank. arXiv:1507.02268 [cs.DS] https://arxiv.org/abs/1507.02268

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. 2025. Scaling FP8 training to trillion-token LLMs. arXiv:2409.12517 [cs.LG] https://arxiv.org/ abs/2409.12517

work page arXiv 2025
[8]

Wei Guo, Hao Wang, Luankang Zhang, Jin Yao Chin, Zhongzhou Liu, Kai Cheng, Qiushi Pan, Yi Quan Lee, Wanqi Xue, Tingjia Shen, Kenan Song, Kefan Wang, Wenjia Xie, Yuyang Ye, Huifeng Guo, Yong Liu, Defu Lian, Ruiming Tang, and Enhong Chen. 2024. Scaling New Frontiers: Insights into Large Recommendation Models. arXiv:2412.00714 [cs.IR] https://arxiv.org/abs/2...

work page arXiv 2024
[9]

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. On the Embedding Collapse when Scaling up Recommendation Models. InForty-first International Conference on Machine Learning. https:// openreview.net/forum?id=aPVwOAr1aW

work page 2024
[10]

Henderson and S

Harold V. Henderson and S. R. Searle. 1981. The vec-permutation matrix, the vec operator and Kronecker products: a review.Linear and Multilinear Algebra9, 4 (1981), 271–288. arXiv:https://doi.org/10.1080/03081088108817379 doi:10.1080/ 03081088108817379

work page doi:10.1080/03081088108817379 1981
[11]

Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG] https://arxiv.org/abs/1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

2012.Matrix analysis

Roger A Horn and Charles R Johnson. 2012.Matrix analysis. Cambridge university press

work page 2012
[13]

Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. 2021. On Feature Decorrelation in Self-Supervised Learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9598–9608

work page 2021
[14]

Olivier Chapelle Jean-Baptiste Tien, joycenv. 2014. Display Advertising Challenge. https://kaggle.com/competitions/criteo-display-ad-challenge

work page 2014
[15]

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2022. Understand- ing Dimensional Collapse in Contrastive Self-supervised Learning. InInterna- tional Conference on Learning Representations. https://openreview.net/forum?id= YevsQ05DEN7

work page 2022
[16]

Yu Kang, Junwei Pan, Jipeng Jin, Shudong Huang, Xiaofeng Gao, and Lei Xiao

work page
[17]

InMachine Learning and Knowledge Discovery in Databases

Towards Unifying Feature Interaction Models for Click-Through Rate Prediction. InMachine Learning and Knowledge Discovery in Databases. Research Track, Rita P. Ribeiro, Bernhard Pfahringer, Nathalie Japkowicz, Pedro Larrañaga, Alípio M. Jorge, Carlos Soares, Pedro H. Abreu, and João Gama (Eds.). Springer Nature Switzerland, Cham, 451–467

work page
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https: //arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

Kenji Kira and Larry A. Rendell. 1992. The feature selection problem: traditional methods and a new algorithm. InProceedings of the Tenth National Conference on Artificial Intelligence(San Jose, California)(AAAI’92). AAAI Press, 129–134

work page 1992
[20]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. 2023. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026

work page 2023
[21]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(London, United Kingdom)(KDD ’18). Association for Computing Machinery, New Y...

work page doi:10.1145/3219819.3220023 2018
[22]

Zhutian Lin, Junwei Pan, Haibin Yu, Xi Xiao, Ximei Wang, Zhixiang Feng, Shifeng Wen, Shudong Huang, Dapeng Liu, and Lei Xiao. 2025. Crocodile: Cross Experts Covariance for Disentangled Learning in Multi-Domain Recommendation. InPro- ceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25...

work page doi:10.1145/3746252.3761332 2025
[23]

Statist.] 10.1214/aos/1176344136 , 6, 461

Jan R. Magnus and H. Neudecker. 1979. The Commutation Matrix: Some Properties and Applications.The Annals of Statistics7, 2 (1979), 381 – 394. doi:10.1214/aos/1176344621

work page doi:10.1214/aos/1176344621 1979
[24]

Brendan McMahan, Gary Holt, D

H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conferen...

work page doi:10.1145/2487575 2013
[25]

Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ads Recommendation in a Collapsed and Entangled World. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 5566–5577...

work page doi:10.1145/3637528 2024
[26]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th Inter- national Conference on Machine Learning (Proceedings of Machi...

work page 2021
[27]

Steffen Rendle. 2010. Factorization Machines. In2010 IEEE International Confer- ence on Data Mining. 995–1000. doi:10.1109/ICDM.2010.127

work page doi:10.1109/icdm.2010.127 2010
[28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

work page 2022
[29]

Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference. 606–610

work page 2007
[30]

Noam Shazeer. 2020. GLU Variants Improve Transformer. arXiv:2002.05202 [cs.LG] https://arxiv.org/abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[31]

Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. InProceedings of the 28th ACM International Conference on Information and Knowledge Management(Beijing, China)(CIKM ’19). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3357384.3357925 2019
[32]

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen. 2024. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. arXiv:2406.05955 [cs.LG] https://arxiv.org/abs/2406.05955

work page arXiv 2024
[33]

Will Cukierski Steve Wang. 2014. Click-Through Rate Prediction. https://kaggle. com/competitions/avazu-ctr-prediction

work page 2014
[34]

2022.Introduction to linear algebra

Gilbert Strang. 2022.Introduction to linear algebra. SIAM

work page 2022
[35]

Roman Vershynin. 2011. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027 [math.PR] https://arxiv.org/abs/1011.3027

work page internal anchor Pith review Pith/arXiv arXiv 2011
[36]

Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, and Xiaoyi Zeng

work page
[37]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25)

Scaling Transformers for Discriminative Recommendation via Generative Pretraining. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 2893–2903. doi:10.1145/3711896. 3737117

work page doi:10.1145/3711896
[38]

Jiancheng Wang, Mingjia Yin, Hao Wang, and Enhong Chen. 2025. Enhancing CTR Prediction with De-correlated Expert Networks. arXiv:2505.17925 [cs.IR] https://arxiv.org/abs/2505.17925

work page arXiv 2025
[39]

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. InProceedings of the Web Conference 2021(Ljubljana, Slovenia)(WWW ’21). Association for Computing Machinery, New York, NY, USA, 1785–1797. doi:10.1145/3442381.3450078

work page doi:10.1145/3442381.3450078 2021
[40]

Jaewoo Yang, Hayun Kim, and Younghoon Kim. 2024. Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs. arXiv:2405.14428 [cs.CL] https://arxiv.org/abs/2405.14428

work page arXiv 2024
[41]

Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, and Enhong Chen. 2025. From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=DatAXrGzlc

work page 2025
[42]

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Shen Li, Yanli Zhao, Yuchen Hao, Yantao Yao, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. 2024. Wukong: Towards a Scaling Law for Large-Scale Recommendation. InForty-first International Conference on Machine Learning. https://openreview.net/forum?id=8iUgr2nuwo

work page 2024
[43]

Gaowei Zhang, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji- Rong Wen. 2024. Scaling Law of Large Sequential Recommendation Models. In Proceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy) (RecSys ’24). Association for Computing Machinery, New York, NY, USA, 444–453. Expand More, Shrink Less: Shaping Effective-Rank Dynamics f...

work page doi:10.1145/3640457.3688129 2024
[44]

Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi- field Categorical Data. InAdvances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello (Eds.). Springer International Publishing, Cham, 45–57

work page 2016
[45]

Yabin Zhang, Jiakai Tang, and Xu Chen. 2025. Alleviating Dimensional Collapse Problem in Deep Recommender Models by Designing Uniformity Layers. InData- base Systems for Advanced Applications, Makoto Onizuka, Jae-Gil Lee, Yongxin Tong, Chuan Xiao, Yoshiharu Ishikawa, Sihem Amer-Yahia, H. V. Jagadish, and Kejing Lu (Eds.). Springer Nature Singapore, Singap...

work page 2025
[46]

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click- Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York...

work page doi:10.1145/3219819.3219823 2018
[47]

Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. 2025. RankMixer: Scaling Up Ranking Models in Industrial Recommenders. InProceedings of the 34th ACM Inter...

work page doi:10.1145/3746252.3761507 2025
[48]

Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2021. Open Benchmarking for Click-Through Rate Prediction. InProceedings of the 30th ACM International Conference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 2759–2769. doi:10.1145/3459637.3482...

work page doi:10.1145/3459637.3482486 2021

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017. The Shattered Gradients Problem: If resnets are the answer, then what is the question?. InProceedings of the 34th Inter- national Conference on Machine Learning (Proceedings of Machine Learning KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea G...

work page 2017

[3] [3]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. 2020. Benign overfitting in linear regression.Proceed- ings of the National Academy of Sciences117, 48 (2020), 30063– 30070. arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.1907378117 doi:10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020

[4] [4]

Don Batory, Peter Höfner, and Jongwook Kim. 2011. Feature interactions, prod- ucts, and composition. InProceedings of the 10th ACM International Conference on Generative Programming and Component Engineering(Portland, Oregon, USA) (GPCE ’11). Association for Computing Machinery, New York, NY, USA, 13–22. doi:10.1145/2047862.2047867

work page doi:10.1145/2047862.2047867 2011

[5] [5]

Huiyuan Chen, Vivian Lai, Hongye Jin, Zhimeng Jiang, Mahashweta Das, and Xia Hu. 2024. Towards Mitigating Dimensional Collapse of Representations in Collab- orative Filtering. InProceedings of the 17th ACM International Conference on Web Search and Data Mining(Merida, Mexico)(WSDM ’24). Association for Computing Machinery, New York, NY, USA, 106–115. doi:...

work page doi:10.1145/3616855.3635832 2024

[6] [6]

Optimal approximate matrix product in terms of stable rank

Michael B. Cohen, Jelani Nelson, and David P. Woodruff. 2016. Optimal ap- proximate matrix product in terms of stable rank. arXiv:1507.02268 [cs.DS] https://arxiv.org/abs/1507.02268

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. 2025. Scaling FP8 training to trillion-token LLMs. arXiv:2409.12517 [cs.LG] https://arxiv.org/ abs/2409.12517

work page arXiv 2025

[8] [8]

Wei Guo, Hao Wang, Luankang Zhang, Jin Yao Chin, Zhongzhou Liu, Kai Cheng, Qiushi Pan, Yi Quan Lee, Wanqi Xue, Tingjia Shen, Kenan Song, Kefan Wang, Wenjia Xie, Yuyang Ye, Huifeng Guo, Yong Liu, Defu Lian, Ruiming Tang, and Enhong Chen. 2024. Scaling New Frontiers: Insights into Large Recommendation Models. arXiv:2412.00714 [cs.IR] https://arxiv.org/abs/2...

work page arXiv 2024

[9] [9]

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. On the Embedding Collapse when Scaling up Recommendation Models. InForty-first International Conference on Machine Learning. https:// openreview.net/forum?id=aPVwOAr1aW

work page 2024

[10] [10]

Henderson and S

Harold V. Henderson and S. R. Searle. 1981. The vec-permutation matrix, the vec operator and Kronecker products: a review.Linear and Multilinear Algebra9, 4 (1981), 271–288. arXiv:https://doi.org/10.1080/03081088108817379 doi:10.1080/ 03081088108817379

work page doi:10.1080/03081088108817379 1981

[11] [11]

Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG] https://arxiv.org/abs/1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

2012.Matrix analysis

Roger A Horn and Charles R Johnson. 2012.Matrix analysis. Cambridge university press

work page 2012

[13] [13]

Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. 2021. On Feature Decorrelation in Self-Supervised Learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9598–9608

work page 2021

[14] [14]

Olivier Chapelle Jean-Baptiste Tien, joycenv. 2014. Display Advertising Challenge. https://kaggle.com/competitions/criteo-display-ad-challenge

work page 2014

[15] [15]

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2022. Understand- ing Dimensional Collapse in Contrastive Self-supervised Learning. InInterna- tional Conference on Learning Representations. https://openreview.net/forum?id= YevsQ05DEN7

work page 2022

[16] [16]

Yu Kang, Junwei Pan, Jipeng Jin, Shudong Huang, Xiaofeng Gao, and Lei Xiao

work page

[17] [17]

InMachine Learning and Knowledge Discovery in Databases

Towards Unifying Feature Interaction Models for Click-Through Rate Prediction. InMachine Learning and Knowledge Discovery in Databases. Research Track, Rita P. Ribeiro, Bernhard Pfahringer, Nathalie Japkowicz, Pedro Larrañaga, Alípio M. Jorge, Carlos Soares, Pedro H. Abreu, and João Gama (Eds.). Springer Nature Switzerland, Cham, 451–467

work page

[18] [18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https: //arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[19] [19]

Kenji Kira and Larry A. Rendell. 1992. The feature selection problem: traditional methods and a new algorithm. InProceedings of the Tenth National Conference on Artificial Intelligence(San Jose, California)(AAAI’92). AAAI Press, 129–134

work page 1992

[20] [20]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. 2023. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026

work page 2023

[21] [21]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(London, United Kingdom)(KDD ’18). Association for Computing Machinery, New Y...

work page doi:10.1145/3219819.3220023 2018

[22] [22]

Zhutian Lin, Junwei Pan, Haibin Yu, Xi Xiao, Ximei Wang, Zhixiang Feng, Shifeng Wen, Shudong Huang, Dapeng Liu, and Lei Xiao. 2025. Crocodile: Cross Experts Covariance for Disentangled Learning in Multi-Domain Recommendation. InPro- ceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25...

work page doi:10.1145/3746252.3761332 2025

[23] [23]

Statist.] 10.1214/aos/1176344136 , 6, 461

Jan R. Magnus and H. Neudecker. 1979. The Commutation Matrix: Some Properties and Applications.The Annals of Statistics7, 2 (1979), 381 – 394. doi:10.1214/aos/1176344621

work page doi:10.1214/aos/1176344621 1979

[24] [24]

Brendan McMahan, Gary Holt, D

H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conferen...

work page doi:10.1145/2487575 2013

[25] [25]

Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ads Recommendation in a Collapsed and Entangled World. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 5566–5577...

work page doi:10.1145/3637528 2024

[26] [26]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th Inter- national Conference on Machine Learning (Proceedings of Machi...

work page 2021

[27] [27]

Steffen Rendle. 2010. Factorization Machines. In2010 IEEE International Confer- ence on Data Mining. 995–1000. doi:10.1109/ICDM.2010.127

work page doi:10.1109/icdm.2010.127 2010

[28] [28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

work page 2022

[29] [29]

Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference. 606–610

work page 2007

[30] [30]

Noam Shazeer. 2020. GLU Variants Improve Transformer. arXiv:2002.05202 [cs.LG] https://arxiv.org/abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020

[31] [31]

Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. InProceedings of the 28th ACM International Conference on Information and Knowledge Management(Beijing, China)(CIKM ’19). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3357384.3357925 2019

[32] [32]

Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen. 2024. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. arXiv:2406.05955 [cs.LG] https://arxiv.org/abs/2406.05955

work page arXiv 2024

[33] [33]

Will Cukierski Steve Wang. 2014. Click-Through Rate Prediction. https://kaggle. com/competitions/avazu-ctr-prediction

work page 2014

[34] [34]

2022.Introduction to linear algebra

Gilbert Strang. 2022.Introduction to linear algebra. SIAM

work page 2022

[35] [35]

Roman Vershynin. 2011. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027 [math.PR] https://arxiv.org/abs/1011.3027

work page internal anchor Pith review Pith/arXiv arXiv 2011

[36] [36]

Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, and Xiaoyi Zeng

work page

[37] [37]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25)

Scaling Transformers for Discriminative Recommendation via Generative Pretraining. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 2893–2903. doi:10.1145/3711896. 3737117

work page doi:10.1145/3711896

[38] [38]

Jiancheng Wang, Mingjia Yin, Hao Wang, and Enhong Chen. 2025. Enhancing CTR Prediction with De-correlated Expert Networks. arXiv:2505.17925 [cs.IR] https://arxiv.org/abs/2505.17925

work page arXiv 2025

[39] [39]

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. InProceedings of the Web Conference 2021(Ljubljana, Slovenia)(WWW ’21). Association for Computing Machinery, New York, NY, USA, 1785–1797. doi:10.1145/3442381.3450078

work page doi:10.1145/3442381.3450078 2021

[40] [40]

Jaewoo Yang, Hayun Kim, and Younghoon Kim. 2024. Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs. arXiv:2405.14428 [cs.CL] https://arxiv.org/abs/2405.14428

work page arXiv 2024

[41] [41]

Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, and Enhong Chen. 2025. From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=DatAXrGzlc

work page 2025

[42] [42]

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Shen Li, Yanli Zhao, Yuchen Hao, Yantao Yao, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. 2024. Wukong: Towards a Scaling Law for Large-Scale Recommendation. InForty-first International Conference on Machine Learning. https://openreview.net/forum?id=8iUgr2nuwo

work page 2024

[43] [43]

Gaowei Zhang, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji- Rong Wen. 2024. Scaling Law of Large Sequential Recommendation Models. In Proceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy) (RecSys ’24). Association for Computing Machinery, New York, NY, USA, 444–453. Expand More, Shrink Less: Shaping Effective-Rank Dynamics f...

work page doi:10.1145/3640457.3688129 2024

[44] [44]

Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi- field Categorical Data. InAdvances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello (Eds.). Springer International Publishing, Cham, 45–57

work page 2016

[45] [45]

Yabin Zhang, Jiakai Tang, and Xu Chen. 2025. Alleviating Dimensional Collapse Problem in Deep Recommender Models by Designing Uniformity Layers. InData- base Systems for Advanced Applications, Makoto Onizuka, Jae-Gil Lee, Yongxin Tong, Chuan Xiao, Yoshiharu Ishikawa, Sihem Amer-Yahia, H. V. Jagadish, and Kejing Lu (Eds.). Springer Nature Singapore, Singap...

work page 2025

[46] [46]

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click- Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York...

work page doi:10.1145/3219819.3219823 2018

[47] [47]

Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. 2025. RankMixer: Scaling Up Ranking Models in Industrial Recommenders. InProceedings of the 34th ACM Inter...

work page doi:10.1145/3746252.3761507 2025

[48] [48]

Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2021. Open Benchmarking for Click-Through Rate Prediction. InProceedings of the 30th ACM International Conference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 2759–2769. doi:10.1145/3459637.3482...

work page doi:10.1145/3459637.3482486 2021