pith. machine review for the scientific record. sign in

arxiv: 2604.13737 · v1 · submitted 2026-04-15 · 💻 cs.IR · cs.AI

Recognition: unknown

TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:45 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords unified recommendationmulti-field featuressequential modelingattention mechanismdimensional collapsefeature interactionuser behavior sequencesrepresentation robustness
0
0 comments X

The pith

TokenFormer unifies multi-field feature interactions and sequential user behavior modeling in one architecture by blocking dimensional collapse of sequence features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recommender systems have long split into two separate lines of work: one that models interactions among many categorical fields and another that tracks sequences of user actions over time. When researchers try to put both into the same model, the sequence features tend to lose their distinct dimensional structure, a failure the paper names Sequential Collapse Propagation. TokenFormer counters this with two targeted changes to the network: a layered attention pattern that starts with full attention and then switches to shrinking sliding windows, plus a non-linear multiplicative step applied to hidden states. These changes let a single model handle both kinds of input while keeping the sequence information intact and more distinguishable. If the approach holds, recommendation systems could stop maintaining two separate modeling traditions and instead use one backbone that works for both feature tables and behavior histories.

Core claim

The paper proposes TokenFormer, a unified recommendation architecture that overcomes Sequential Collapse Propagation through a Bottom-Full-Top-Sliding attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers, together with Non-Linear Interaction Representation that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform show state-of-the-art performance, while analysis confirms improved dimensional robustness and representation discriminability under unified modeling.

What carries the argument

Bottom-Full-Top-Sliding (BFTS) attention, which runs full self-attention at lower layers and shrinking sliding-window attention at upper layers, combined with Non-Linear Interaction Representation (NLIR) that performs one-sided non-linear multiplicative transformations on hidden states.

Load-bearing premise

That the BFTS attention pattern and NLIR transformations are the direct cause of avoiding sequence collapse and improving robustness, rather than differences in model capacity, training procedure, or evaluation choices.

What would settle it

An ablation experiment on the same benchmarks that replaces BFTS with standard full attention and NLIR with linear interactions yet still shows equivalent or better performance and no collapse would falsify the claim that these two components are required for successful unification.

Figures

Figures reproduced from arXiv: 2604.13737 by Chengguo Yin, Haijie Gu, Hanyong Li, Jie Jiang, Junwei Pan, Kaihui Wu, Shangyu Zhang, Shudong Huang, Yifeng Zhou, Yuehong Hu, Zhangbin Zhu, Zhixiang Feng.

Figure 1
Figure 1. Figure 1: The central tradeoff in unified recommendation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TokenFormer. TokenFormer represents multi-field features F, sequential behavior tokens T, and target features V as a unified token stream, which is processed by stacked Unified Interaction Blocks (UIBs). Each UIB combines the proposed Bottom-Full-Top-Sliding (BFTS) attention design, which applies full causal attention in shallow layers and shrinking SWA in deeper layers, with the Non-Linear Int… view at source ↗
Figure 3
Figure 3. Figure 3: Discriminability analysis across varying cluster [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effective rank comparison of sequential behavioral [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of attention patterns. Top: Vanilla Transformer suffers from redundant revisiting of static fields in last layers; [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: Attention receptive field distributions . His [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Efficiency and effectiveness trade-offs of various [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Block-wise effective-rank trajectory of sequential [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-layer normalized singular value spectra. Com [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Recommender systems have historically developed along two largely independent paradigms: feature interaction models for modeling correlations among multi-field categorical features, and sequential models for capturing user behavior dynamics from historical interaction sequences. Although recent trends attempt to bridge these paradigms within shared backbones, we empirically reveal that naive unifying these two branches may lead to a failure mode of Sequential Collapse Propagation (SCP). That is, the interaction with those dimensionally ill non-sequence fields leads to the dimensional collapse of the sequence features. To overcome this challenge, we propose TokenFormer, a unified recommendation architecture with the following innovations. First, we introduce a Bottom-Full-Top-Sliding (BFTS) attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers. Second, we introduce a Non-Linear Interaction Representation (NLIR) that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform demonstrate state-of-the-art performance, while detailed analysis confirm that TokenFormer significantly improves dimensional robustness and representation discriminability under unified modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a failure mode termed Sequential Collapse Propagation (SCP) when naively unifying multi-field feature-interaction models with sequential recommendation models, in which interactions with dimensionally ill non-sequence fields cause collapse of sequence features. It proposes TokenFormer, a unified architecture that applies a Bottom-Full-Top-Sliding (BFTS) attention scheme (full self-attention in lower layers, shrinking-window sliding attention in upper layers) together with Non-Linear Interaction Representation (NLIR) via one-sided non-linear multiplicative transformations on hidden states. The paper reports state-of-the-art results on public benchmarks and Tencent advertising data, together with improved dimensional robustness and representation discriminability under unified modeling.

Significance. If the empirical claims are substantiated by properly controlled experiments, the work would offer a practical bridge between two historically separate recommendation paradigms and a concrete mechanism for preserving sequence-feature dimensionality. The emphasis on dimensional robustness under unification is a potentially valuable contribution, but its significance hinges on whether the reported gains are causally attributable to BFTS and NLIR rather than unmatched capacity, training schedules, or evaluation choices.

major comments (2)
  1. [Abstract] Abstract: the central claim that naive unification produces Sequential Collapse Propagation is asserted without any formal definition, equations, or illustrative derivation; this absence makes it impossible to verify whether the proposed BFTS and NLIR mechanisms are necessary or sufficient to address the stated problem.
  2. [Experiments] Experiments (implied by abstract claims): no information is supplied on baseline capacity matching, hyper-parameter schedules, data preprocessing, or ablation studies that isolate the contribution of lower-layer full attention versus upper-layer sliding windows versus the NLIR non-linearity; without these controls the attribution of SOTA performance and robustness gains to the two innovations remains unverified.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'one-sided non-linear multiplicative transformations' is introduced without a mathematical specification or reference to the exact functional form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on clarity and experimental controls. We address each point below and will revise the manuscript to strengthen verifiability of the SCP claim and attribution of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that naive unification produces Sequential Collapse Propagation is asserted without any formal definition, equations, or illustrative derivation; this absence makes it impossible to verify whether the proposed BFTS and NLIR mechanisms are necessary or sufficient to address the stated problem.

    Authors: We agree that the abstract is too concise to stand alone on this point. The main text (Section 3.1) formally defines SCP as the propagation of dimensionality mismatch from non-sequence fields through shared attention, leading to sequence feature collapse (with the condition ||h_seq|| -> 0 derived from the attention update rule in Eq. (3)-(4)). To make the abstract self-contained, we will add a one-sentence formal characterization of SCP and note that BFTS/NLIR are designed to mitigate it. revision: yes

  2. Referee: [Experiments] Experiments (implied by abstract claims): no information is supplied on baseline capacity matching, hyper-parameter schedules, data preprocessing, or ablation studies that isolate the contribution of lower-layer full attention versus upper-layer sliding windows versus the NLIR non-linearity; without these controls the attribution of SOTA performance and robustness gains to the two innovations remains unverified.

    Authors: The manuscript reports capacity-matched baselines (parameter counts within 5% in Table 1), standard grid-search hyperparameter tuning on validation sets (Appendix C), and preprocessing details (Section 4.1). Section 4.3 already contains ablations that isolate BFTS layers (full vs. sliding) and NLIR (with/without the one-sided non-linearity). However, to improve transparency we will expand the experimental section with an explicit controls subsection, additional tables on matched capacities, and finer-grained ablations separating the lower-layer full attention from the upper-layer sliding windows. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical validation, no derivations or self-referential reductions

full rationale

The paper claims an empirical observation of Sequential Collapse Propagation under naive unification of multi-field and sequential models, then introduces BFTS attention (full self-attention in lower layers, shrinking-window in upper) and NLIR (one-sided non-linear multiplicative transforms) as architectural remedies, validated by SOTA results on public benchmarks and Tencent data. No mathematical derivation chain, equations, or first-principles results appear in the abstract or described claims. Performance and robustness improvements are presented as experimental outcomes rather than predictions derived from fitted inputs or self-citations. No load-bearing steps reduce by construction to the inputs; the work is self-contained as an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claim rests on the empirical observation of Sequential Collapse Propagation and on the effectiveness of two newly introduced components (BFTS and NLIR) whose independent grounding is not supplied in the abstract. No free parameters or background axioms are stated.

invented entities (3)
  • Sequential Collapse Propagation (SCP) no independent evidence
    purpose: Names the dimensional collapse failure mode that occurs when non-sequence fields interact with sequence features
    Introduced as an empirical finding of the paper; no independent verification or prior citation is mentioned in the abstract.
  • Bottom-Full-Top-Sliding (BFTS) attention scheme no independent evidence
    purpose: Combines full self-attention in lower layers with shrinking-window sliding attention in upper layers
    Newly proposed architectural pattern; no prior reference or independent evidence supplied in the abstract.
  • Non-Linear Interaction Representation (NLIR) no independent evidence
    purpose: Applies one-sided non-linear multiplicative transformations to hidden states
    Newly proposed transformation; no prior reference or independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5528 in / 1533 out tokens · 69544 ms · 2026-05-10T12:45:02.337214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

  2. [2]

    Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al. 2025. LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders. arXiv:2505.04421 [cs.IR]

  3. [3]

    Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 3784–3794

  4. [4]

    Jianxin Chang, Chenbin Zhang, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. Pepnet: Parameter and embedding personalized network for infusing with personalized prior information. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3804

  5. [5]

    Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. InProceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data. 1–4

  6. [6]

    Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al

  7. [7]

    InProceedings of the 1st Workshop on Deep Learning for Recommender Systems

    Wide & Deep Learning for Recommender Systems. InProceedings of the 1st Workshop on Deep Learning for Recommender Systems. 7–10

  8. [8]

    Weiyu Cheng, Yanyan Shen, and Linpeng Huang. 2020. Adaptive factorization network: Learning adaptive-order feature interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3609–3616

  9. [9]

    Qin Ding, Kevin Course, Linjian Ma, Jianhui Sun, Rouchen Liu, Zhao Zhu, Chunx- ing Yin, Wei Li, Dai Li, Yu Shi, et al. 2026. Bending the Scaling Law Curve in Large-Scale Recommendation Systems.arXiv preprint arXiv:2602.16986(2026)

  10. [10]

    Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 2301–2307

  11. [11]

    Huan Gui, Ruoxi Wang, Ke Yin, Long Jin, Maciej Kula, Taibai Xu, Lichan Hong, and Ed H Chi. 2023. Hiformer: Heterogeneous feature interactions learning with transformers for recommender systems.arXiv preprint arXiv:2311.05884(2023)

  12. [12]

    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. InProceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731

  13. [13]

    Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. On the embedding collapse when scaling up recommendation models (ICML’24). JMLR.org, Article 671, 19 pages

  14. [14]

    Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 355–364

  15. [15]

    Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk

  16. [16]

    In International Conference on Learning Representations (ICLR) Workshop

    Session-based Recommendations with Recurrent Neural Networks. In International Conference on Learning Representations (ICLR) Workshop

  17. [17]

    Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Jiyan Yang, and Huayu Li....

  18. [18]

    Ruijie Hou, Zhaoyang Yang, Yu Ming, Hongyu Lu, Zhuobin Zheng, Yu Chen, Qinsong Zeng, and Ming Chen. 2024. Cross-Domain LifeLong Sequential Model- ing for Online Click-Through Rate Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 5116–5125. doi:10.1145/3637528.3671601

  19. [19]

    Xian Hu, Ming Yue, Zhixiang Feng, Junwei Pan, Junjie Zhai, Ximei Wang, Xinrui Miao, Qian Li, Xun Liu, Shangyu Zhang, et al. 2025. Practice on Long Behavior Sequence Modeling in Tencent Advertising. arXiv:2510.21714 [cs.IR]

  20. [20]

    Tongwen Huang, Zhiqi Zhang, and Junlin Zhang. 2019. FiBiNET: Combining Feature Importance and Bilinear Feature Interaction for Click-Through Rate Prediction. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys). 169–177

  21. [21]

    Yunwen Huang, Shiyong Hong, Xijun Xiao, Jinqiu Jin, Xuanyuan Luo, Zhe Wang, Zheng Chai, Shikang Wu, Yuchao Zheng, and Jingjian Lin. 2026. HyFormer: Revis- iting the Roles of Sequence Modeling and Feature Interaction in CTR Prediction. arXiv preprint arXiv:2601.12681(2026)

  22. [22]

    2025.In- ternet Advertising Revenue Report: Full Year 2024

    Interactive Advertising Bureau and PricewaterhouseCoopers. 2025.In- ternet Advertising Revenue Report: Full Year 2024. Technical Report. In- teractive Advertising Bureau (IAB) and PricewaterhouseCoopers (PwC). https://www.iab.com/wp-content/uploads/2025/04/IAB_PwC-Internet-Ad- Revenue-Report-Full-Year-2024.pdf Reports U.S. internet advertising revenue of ...

  23. [23]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/2310.06...

  24. [24]

    Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field- aware Factorization Machines for CTR Prediction. InProceedings of the 10th ACM Conference on Recommender Systems. 43–50

  25. [25]

    Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Rec- ommendation. In2018 IEEE International Conference on Data Mining (ICDM). 197–206

  26. [26]

    2025.Digital 2025: The State of Social Media in 2025

    Simon Kemp. 2025.Digital 2025: The State of Social Media in 2025. DataRepor- tal. https://datareportal.com/reports/digital-2025-sub-section-state-of-social Reports 5.24 billion active social media user identities worldwide in early 2025

  27. [27]

    2025.Digital 2025: Top Social Platforms in 2025

    Simon Kemp. 2025.Digital 2025: Top Social Platforms in 2025. DataReportal. https: //datareportal.com/reports/digital-2025-sub-section-top-social-platforms Re- ports that TikTok’s Android user base spent almost 35 hours using the app in November 2024

  28. [28]

    2025.TikTok Users, Stats, Data & Trends for 2025

    Simon Kemp. 2025.TikTok Users, Stats, Data & Trends for 2025. DataReportal. https://datareportal.com/essential-tiktok-stats Reports TikTok advertising reach of at least 1.59 billion users in January 2025

  29. [29]

    Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- niques for Recommender Systems.Computer42, 8 (2009), 30–37

  30. [30]

    Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1754– 1763

  31. [31]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer us- ing shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

  32. [32]

    Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. 2018. Field-weighted Factorization Machines for Click-Through Rate Prediction in Display Advertising. InProceedings of The Web Conference (WWW). 1349–1357

  33. [33]

    Junwei Pan, Wei Xue, Ximei Wang, Haibin Yu, Xun Liu, Shijie Quan, Xueming Qiu, Dapeng Liu, Lei Xiao, and Jie Jiang. 2024. Ads Recommendation in a Collapsed and Entangled World. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3319–3330

  34. [34]

    Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692

  35. [35]

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, et al. 2025. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free.arXiv preprint arXiv:2505.06708(2025)

  36. [36]

    Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang

  37. [37]

    InProceed- ings of the 2016 IEEE International Conference on Data Mining

    Product-Based Neural Networks for User Response Prediction. InProceed- ings of the 2016 IEEE International Conference on Data Mining. 1149–1154

  38. [38]

    Steffen Rendle. 2010. Factorization Machines. In2010 IEEE International Confer- ence on Data Mining (ICDM). 995–1000

  39. [39]

    Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural collaborative filtering vs. matrix factorization revisited. InProceedings of the 14th ACM conference on recommender systems. 240–248

  40. [40]

    Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: estimating the click-through rate for new ads. InProceedings of the 16th international conference on World Wide Web. 521–530

  41. [41]

    Zihua Si, Lin Guan, ZhongXiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, et al . 2024. TWIN-V2: Scaling Ultra-Long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 4890–4897

  42. [42]

    Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self- Attentive Neural Networks. InProceedings of the 28th ACM International Confer- ence on Information and Knowledge Management (CIKM)

  43. [43]

    Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang

  44. [44]

    InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)

    BERT4Rec: Sequential Recommendation with Bidirectional Encoder Rep- resentations from Transformer. InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 1441–1450

  45. [45]

    2021.𝐹 𝑀2: Field-matrixed Factorization Machines for Recommender Systems

    Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. 2021.𝐹 𝑀2: Field-matrixed Factorization Machines for Recommender Systems. InProceedings of the Web Conference (WWW). 2828–2837

  46. [46]

    Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommenda- tion via Convolutional Sequence Embedding. InProceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM)

  47. [47]

    Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17. 1–7

  48. [48]

    Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H

    Ruoxi Wang, Rakesh Shivanna, Derek Z. Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H. Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. InProceedings of the Web Conference (WWW). 1785–1797

  49. [49]

    Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, and Enhong Chen. 2025. From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models.arXiv preprint arXiv:2512.14041 (2025)

  50. [50]

    Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, et al. 2025. InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction. InProceedings of the 34th ACM International Conference on Information and Knowledge Management

  51. [51]

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. InProceedings of the 41st International Conference on Machine Learning (ICML)

  52. [52]

    Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, and Wenlin Chen. 2024. Wukong: Towards a Scaling Law for Large-Scale Recommendation. arXiv:2403.02545 [cs.LG]

  53. [53]

    Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. 2022. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2060–2069

  54. [54]

    Junjie Zhang, Ruobing Xie, Hongyu Lu, Wenqi Sun, Wayne Xin Zhao, Yu Chen, and Zhanhui Kang. 2025. Frequency-Augmented Mixture-of-Heterogeneous- Experts Framework for Sequential Recommendation. InProceedings of the ACM on Web Conference 2025. 2596–2607

  55. [55]

    Zhaoqi Zhang, Haolei Pei, Jun Guo, Tianyu Wang, Yufei Feng, Hui Sun, Shaowei Liu, and Aixin Sun. 2025. OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender. arXiv:2510.26104 [cs.IR]

  56. [56]

    Zuowu Zheng, Xiaofeng Gao, Junwei Pan, Qi Luo, Guihai Chen, Dapeng Liu, and Jie Jiang. 2022. AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling. In2022 IEEE International Conference on Data Mining (ICDM). 1257–1262

  57. [57]

    Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, et al. 2025. Onerec-v2 technical report.arXiv preprint arXiv:2508.20900(2025)

  58. [58]

    Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948

  59. [59]

    Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1059– 1068. 12 TokenFormer: Unify the Multi-Field and Sequential Recomm...

  60. [60]

    Haolin Zhou, Junwei Pan, Xinyi Zhou, Xihua Chen, Jie Jiang, Xiaofeng Gao, and Guihai Chen. 2024. Temporal Interest Network for User Response Prediction. In Companion Proceedings of the ACM on Web Conference 2024. 413–422

  61. [61]

    Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. 2025. RankMixer: Scaling Up Ranking Models in Industrial Recommenders. arXiv:2507.15551 [cs.IR] A Complexity Analysis and Serving Optimization This appendix analyzes the computational complexity of the pro- posedBottom-Full-T...