Make It Long, Keep It Fast: End-to-End 10K Long User Behavior Sequence Modeling for Billion-Scale Douyin Recommendation
Pith reviewed 2026-05-21 19:05 UTC · model grok-4.3
The pith
Stacked target-to-history cross attention and request-level batching let recommendation systems model 10K-length user histories end-to-end at production scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The system replaces history self-attention with Stacked Target-to-History Cross Attention to achieve linear complexity in sequence length, introduces Request Level Batching to aggregate targets for the same user and share encoding costs, and applies length-extrapolative training so that models trained on shorter sequences can be deployed on 10K-length histories. These techniques together enable end-to-end training and inference at the 10K regime while preserving the original learning objective and production latency requirements.
What carries the argument
Stacked Target-to-History Cross Attention (STCA), which performs repeated cross-attention layers from the target item representation to the full user history sequence, turning quadratic self-attention into linear complexity while preserving interaction modeling.
If this is right
- Recommendation quality improves monotonically and predictably as history length scales from hundreds to 10K items.
- Gains mirror the scaling laws seen in large language models when both sequence length and model capacity are increased.
- End-to-end training over long sequences becomes feasible without quadratic compute or memory blow-up.
- Production systems can adopt the same architecture at full traffic while meeting strict latency targets.
Where Pith is reading between the lines
- The linear scaling may allow sequence lengths to grow further in future deployments as hardware improves.
- Similar cross-attention and batching patterns could transfer to other domains that rely on long user or session histories.
- Dynamic history truncation at inference time could be tuned per user without retraining.
Load-bearing premise
Training on shorter sequence windows produces a model that generalizes without loss to much longer histories at inference time.
What would settle it
Measure offline and online metrics after training exclusively on sequences shorter than 2K and then running inference on full 10K histories; if accuracy drops or latency exceeds the budget relative to a shorter-history baseline, the extrapolation claim fails.
Figures
read the original abstract
Short-video recommenders such as Douyin must exploit extremely long user behavior histories without breaking latency or cost budgets. We present an end-to-end industrial recommender system that scales long-sequence recommendation modeling to 10K-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training over long user behavior sequences. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10K-scale histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end ultra-long sequence recommendation to the 10K regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to scale end-to-end user behavior sequence modeling to 10K-length histories in a billion-scale short-video recommender (Douyin). It introduces Stacked Target-to-History Cross Attention (STCA) to replace quadratic self-attention with linear stacked cross-attention from target to history, Request Level Batching (RLB) to share user encodings across multiple targets per request, and a length-extrapolative training strategy that trains on shorter windows and infers on 10K histories. Offline and online experiments report monotonic gains with increased history length and model capacity; the system is deployed at full traffic with significant engagement metric improvements while satisfying production latency constraints.
Significance. If the empirical results and deployment claims hold under scrutiny, the work provides a concrete, production-validated path for linear-complexity long-sequence modeling in industrial recommenders. The combination of architectural simplification (STCA), batching efficiency (RLB), and training-time extrapolation offers a practical template that could influence scaling strategies beyond the current 1K–2K regime common in the field, especially where latency budgets preclude full self-attention.
major comments (2)
- [§3.3] §3.3 (length-extrapolative training): The claim that training on shorter windows generalizes to 10K inference without degradation or extra cost is central to the scaling narrative, yet the manuscript provides no ablation isolating attention score distributions, positional signal fidelity, or ranking calibration at 10K versus the training length. Without such diagnostics, it remains unclear whether observed gains stem from true extrapolation or from simply exposing more history at test time under a fixed training distribution.
- [§4.1–4.2] §4.1–4.2 (offline experiments): The reported monotonic gains and scaling-law behavior are presented without error bars, statistical significance tests, or ablation tables that isolate STCA, RLB, and the extrapolation strategy. This makes it difficult to assess robustness of the central claim that the system delivers predictable improvements at 10K scale.
minor comments (2)
- [§3.1] Notation for STCA stacking depth and cross-attention heads is introduced without an explicit equation or diagram showing how the stacked layers compose; a small illustrative figure would improve clarity.
- [§3.2] The description of Request Level Batching would benefit from a concrete example of storage/communication savings for a typical user with multiple targets per request.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach and committing to revisions where the manuscript can be strengthened.
read point-by-point responses
-
Referee: [§3.3] §3.3 (length-extrapolative training): The claim that training on shorter windows generalizes to 10K inference without degradation or extra cost is central to the scaling narrative, yet the manuscript provides no ablation isolating attention score distributions, positional signal fidelity, or ranking calibration at 10K versus the training length. Without such diagnostics, it remains unclear whether observed gains stem from true extrapolation or from simply exposing more history at test time under a fixed training distribution.
Authors: We agree that explicit diagnostics on attention score distributions, positional signal fidelity, and ranking calibration at extended lengths would provide stronger support for the extrapolation claim. The original manuscript emphasizes the observed monotonic gains and production results but does not include these specific ablations. In the revised version we will add an analysis (in §3.3 or an appendix) comparing attention distributions and ranking metrics (e.g., NDCG and calibration error) for models trained on shorter windows and evaluated at 10K, to better separate the contribution of extrapolation from the benefit of additional history. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2 (offline experiments): The reported monotonic gains and scaling-law behavior are presented without error bars, statistical significance tests, or ablation tables that isolate STCA, RLB, and the extrapolation strategy. This makes it difficult to assess robustness of the central claim that the system delivers predictable improvements at 10K scale.
Authors: We acknowledge that the absence of error bars, formal significance tests, and component-wise ablation tables limits the ability to evaluate robustness. The current presentation focuses on the overall scaling trends and deployment outcomes. In the revised manuscript we will augment §4.1 and §4.2 with error bars on the scaling plots, report statistical significance for the key improvements, and include a consolidated ablation table that isolates the individual and joint contributions of STCA, RLB, and length-extrapolative training. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper introduces STCA (stacked target-to-history cross-attention for linear complexity), RLB (request-level batching), and a length-extrapolative training strategy (train short, infer long). These are architectural and procedural proposals whose benefits are reported from offline experiments, online A/B tests, and production deployment on Douyin. No equations, self-citations, or fitted parameters are shown to reduce the claimed scaling gains or generalization to the inputs by construction. The length-extrapolation claim rests on observed monotonic improvements rather than a tautological redefinition or self-referential theorem.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
STCA replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Similar Users-Augmented Interest Network
SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.
-
IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems
IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance an...
-
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- Document Transformer. arXiv:2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[2]
Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. InKDD
work page 2023
-
[3]
Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Be- havior Sequence Transformer for E-commerce Recommendation in Alibaba. arXiv:1905.06874
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah
- [5]
-
[6]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InRecSys
work page 2016
-
[7]
Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. InACL
work page 2019
-
[8]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InNeurIPS
work page 2022
-
[9]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. InWWW
work page 2017
-
[10]
Zhicheng He, Weiwen Liu, Wei Guo, Jiarui Qin, Yingxue Zhang, Yaochen Hu, and Ruiming Tang. 2023. A Survey on User Behavior Modeling in Recommender Systems. InIJCAI
work page 2023
-
[11]
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk
-
[12]
Session-based Recommendations with Recurrent Neural Networks. In ICLR
-
[13]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Dmytro Ivchenko, Dennis Van Der Staay, Colin Taylor, Xing Liu, Will Feng, Rahul Kindi, Anirudh Sudarshan, and Shahin Sefati. 2022. TorchRec: a PyTorch Domain Library for Recommendation Systems. InRecSys
work page 2022
-
[15]
Wang-Cheng Kang and Julian J. McAuley. 2018. Self-Attentive Sequential Rec- ommendation. InICDM
work page 2018
-
[16]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[17]
Barrie Kersbergen, Olivier Sprangers, and Sebastian Schelter. 2022. Serenade - Low-Latency Session-Based Recommendation in e-Commerce at Scale. InSIG- MOD
work page 2022
-
[18]
Chao Li, Zhiyuan Liu, Mengmeng Wu, Yuchi Xu, Huan Zhao, Pipei Huang, Guoliang Kang, Qiwei Chen, Wei Li, and Dik Lun Lee. 2019. Multi-Interest Network with Dynamic Routing for Recommendation at Tmall. InCIKM
work page 2019
-
[19]
Qianying Lin, Wen-Ji Zhou, Yanshi Wang, Qing Da, Qing-Guo Chen, and Bing Wang. 2022. Sparse Attentive Memory Network for Click-through Rate Prediction with Long Sequences. InCIKM
work page 2022
-
[20]
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommenda- tions: Item-to-Item Collaborative Filtering.IEEE Internet Comput.7, 1 (2003), 76–80
work page 2003
-
[21]
Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan, Ellie Wen, Jongsoo Park, Dheevatsa Mudigere, and Maxim Naumov. 2024. Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation. InMLSys
work page 2024
-
[22]
Benjamin M. Marlin, Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. InUAI
work page 2007
-
[23]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yin...
work page 2022
-
[24]
Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, and Mikhail Smelyanskiy. 2020. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. arXiv:2003.09518
-
[25]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherni- avskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kon- dratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao,...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. 2022. Pinner- Former: Sequence Modeling for User Representation at Pinterest. InKDD
work page 2022
-
[27]
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. InCIKM
work page 2020
-
[28]
Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. InICLR
work page 2022
-
[29]
Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui Jin, Yuchen Fang, and Yong Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. InSIGIR
work page 2020
-
[30]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme
-
[31]
BPR: Bayesian Personalized Ranking from Implicit Feedback. InUAI
-
[32]
Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor- izing personalized Markov chains for next-basket recommendation. InWWW
work page 2010
-
[33]
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing Learning and Evaluation. InICML
work page 2016
-
[34]
Zihua Si, Lin Guan, Zhongxiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, Kai Zheng, Chenbin Zhang, Yanan Niu, Yang Song, and Kun Gai. 2024. TWIN V2: Scaling Ultra-Long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou. InCIKM
work page 2024
-
[35]
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing568 (2024), 127063
work page 2024
-
[36]
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang
-
[37]
InCIKM, Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Repre- sentations from Transformer. InCIKM, Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.)
-
[38]
Xue Xia, Pong Eksombatchai, Nikil Pancha, Dhruvil Deven Badani, Po-Wei Wang, Neng Gu, Saurabh Vishwas Joshi, Nazanin Farahpour, Zhiyuan Zhang, and An- drew Zhai. 2023. TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest. InKDD
work page 2023
- [39]
-
[40]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. InNeurIPS
work page 2020
-
[41]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. InICML
work page 2024
-
[42]
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives.ACM Comput. Surv.52, 1 (2019), 5:1–5:38
work page 2019
-
[43]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. InAAAI. 5941–5948
work page 2019
-
[44]
Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. InKDD
work page 2018
-
[45]
Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. 2025. RankMixer: Scaling Up Ranking Models in Industrial Recommenders. arXiv:2507.15551 Conference’17, Jul...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.