Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale
Pith reviewed 2026-05-08 01:57 UTC · model grok-4.3
The pith
Versioned late materialization stores user histories once to support ultra-long sequences in recommendation model training without storage blowup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that versioned late materialization eliminates data redundancy in training ultra-long sequences for deep learning recommendation models by storing immutable user interaction histories in a normalized tier and reconstructing sequences on demand with lightweight versioned pointers, supported by a bifurcated consistency protocol and projection pushdown, while disaggregated preprocessing keeps the system efficient.
What carries the argument
Versioned late materialization, a paradigm that stores user interaction history once in an immutable normalized store and uses versioned pointers for just-in-time sequence reconstruction during training.
If this is right
- Storage scales with unique user histories rather than with every training example's sequence.
- Multiple models with varying sequence length needs share one dataset without data duplication.
- Ultra-long sequences become feasible, leading to quality improvements in deployed models.
- Training remains compute-bound despite on-the-fly reconstruction thanks to I/O masking.
- Provides foundational infrastructure for sequence-heavy architectures like HSTU.
Where Pith is reading between the lines
- This method could be adapted to other domains with shared long-context data, such as sequential recommendation in other fields or long-context language modeling.
- It opens the possibility for more granular control over sequence lengths per training batch without additional storage costs.
- The immutable tier might enable efficient versioning for auditing or rollback in data pipelines.
Load-bearing premise
The assumption that pipelined I/O prefetching and data-affinity optimizations in disaggregated preprocessing can fully hide the latency of reconstructing sequences at training time.
What would settle it
Observing that training throughput becomes limited by data I/O or reconstruction overhead rather than GPU compute as sequence lengths increase in the production deployment would show the performance masking does not hold.
Figures
read the original abstract
Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a versioned late materialization paradigm to address the storage and I/O challenges in training deep learning recommendation models (DLRMs) with ultra-long user interaction sequences. By storing sequences in a normalized immutable tier and using lightweight versioned pointers for just-in-time reconstruction during training, it aims to eliminate redundancy in the 'Fat Row' approach, particularly in multi-tenant settings. The system incorporates a bifurcated protocol for online-to-offline consistency and disaggregated preprocessing with pipelined I/O to keep training GPU compute-bound. It claims production deployment success in reducing data infrastructure usage and enabling sequence scaling for quality improvements in models like HSTU and ULTRA-HSTU.
Significance. Should the I/O masking and consistency guarantees hold under production conditions, this architecture could be highly significant for the field of recommendation systems by allowing efficient scaling of sequence lengths without exploding storage and I/O costs. It provides a foundational data infrastructure that supports advanced model architectures, potentially leading to better model quality at scale.
major comments (2)
- [Abstract] Abstract: The assertion that 'Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs' is load-bearing for the central claim of reduced infrastructure usage without throughput loss, yet no quantitative evidence (reconstruction latency distributions, prefetch hit rates, tail I/O stalls, or GPU utilization under heterogeneous tenant loads) is supplied to support it when sequences reach ultra-long regimes.
- [Abstract] Abstract: The claims that the system 'reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains' and serves as 'foundational data infrastructure' for HSTU/ULTRA-HSTU rest on deployment assertions but supply no measurements, baselines, error bars, or ablation results, making the practical impact unverifiable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We have revised the manuscript to address the concerns by adding explicit references to the quantitative results and supporting sections in the body of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs' is load-bearing for the central claim of reduced infrastructure usage without throughput loss, yet no quantitative evidence (reconstruction latency distributions, prefetch hit rates, tail I/O stalls, or GPU utilization under heterogeneous tenant loads) is supplied to support it when sequences reach ultra-long regimes.
Authors: We agree that the abstract would benefit from direct links to the supporting data. In the revised manuscript, we have updated the abstract to reference Section 5, which presents reconstruction latency distributions, prefetch hit rates, tail I/O stall analysis, and GPU utilization metrics under heterogeneous tenant loads for ultra-long sequences. These results substantiate that the I/O latency is masked and training remains compute-bound. revision: yes
-
Referee: [Abstract] Abstract: The claims that the system 'reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains' and serves as 'foundational data infrastructure' for HSTU/ULTRA-HSTU rest on deployment assertions but supply no measurements, baselines, error bars, or ablation results, making the practical impact unverifiable.
Authors: The referee is correct that the abstract lacks specific metrics. We have revised the abstract to include high-level references to the observed infrastructure reductions and quality improvements, with explicit pointers to Sections 6 and 7. Those sections contain the full measurements, baselines, error bars, and ablation results from our production deployment that support the claims. revision: yes
Circularity Check
No circularity: systems architecture description with no derivations or self-referential fits
full rationale
The paper describes a data infrastructure architecture (versioned late materialization, bifurcated O2O protocol, disaggregated preprocessing with pipelined prefetching) for DLRM training. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. Claims rest on engineering design choices and production deployment outcomes rather than any reduction of a result to its own inputs by construction. The central performance assertion (reconstruction latency masked to keep training GPU-bound) is an empirical systems claim, not a mathematical derivation that loops back on itself. This matches the default expectation for non-circular systems papers.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bifurcated protocol prevents future leakage across streaming and batch training
- domain assumption Read-optimized immutable storage layer supports multi-dimensional projection pushdown
invented entities (1)
-
versioned pointers
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Daniel J Abadi, Daniel S Myers, David J DeWitt, and Samuel Madden. 2007. Materialization Strategies in a Column-Oriented DBMS. InProceedings of the 23rd IEEE International Conference on Data Engineering
2007
- [2]
-
[3]
Philip Bernstein and Nathan Goodman. 1983. Multiversion Concurrency Control - Theory and Algorithms.ACM Trans. Database Syst.8 (12 1983), 465–483. doi:10.1145/319996.319998
-
[4]
Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al . 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256
2025
-
[5]
Qiwei Chen, Changhua Pei, Shuguang Lv, Cheng Li, Jian Ge, and Wenwu Ou
- [6]
-
[7]
Lee, Khush- hall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, and Wen-Yun Yang
Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H. Lee, Khush- hall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, and Wen-Yun Yang
-
[8]
arXiv:2510.22049 [cs.IR] https://arxiv.org/abs/2510.22049
Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders. arXiv:2510.22049 [cs.IR] https://arxiv.org/abs/2510.22049
-
[9]
Qin Ding, Kevin Course, Linjian Ma, Jianhui Sun, Ruochen Liu, Zhao Zhu, Chunx- ing Yin, Wei Li, Dai Li, Yu Shi, Xuan Cao, Ze Yang, Han Li, Xing Liu, Bi Xue, Hongwei Li, Rui Jian, Daisy Shi He, Jing Qian, Matt Ma, Qunshu Zhang, and Rui Li. 2026. Bending the Scaling Law Curve in Large-Scale Recommendation Systems. arXiv:2602.16986 [cs.IR] https://arxiv.org/...
-
[10]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications.ACM Trans. Storage17, 4, Article 26 (Oct. 2021), 32 pages. doi:10. 1145/3483840
2021
- [11]
-
[12]
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atber, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. InProceedings of the Eighth International Workshop on Data Mining for Online Advertising. 1–9
2014
-
[13]
Manos Karpathiotakis, Vlassios Rizopoulos, Basri Kahveci, Tiziano Carotti, Artem Gelum, Hazem Nada, and Yuri Dolgov. 2025. Scribe: How Meta Transports Terabytes per Second in Real Time.Proceedings of the VLDB Endowment18 (09 2025), 4817–4830. doi:10.14778/3750601.3750607
-
[14]
Dai Li, Kevin Course, Wei Li, Hongwei Li, Jie Hua, Yiqi Chen, Zhao Zhu, Rui Jian, Xuan Cao, Bi Xue, Yu Shi, Jing Qian, Kai Ren, Matt Ma, Qunshu Zhang, and Rui Li. 2025. Realizing Scaling Laws in Recommender Systems: A Foundation- Expert Paradigm for Hyperscale Model Deployment. arXiv:2508.02929 [cs.IR] https://arxiv.org/abs/2508.02929
- [15]
-
[16]
Wenhan Lyu, Devashish Tyagi, Yihang Yang, Ziwei Li, Ajay Somani, Karthikeyan Shanmugasundaram, Nikola Andrejevic, Ferdi Adeputra, Curtis Zeng, Arun K Singh, et al. 2025. DV365: Extremely Long User History Modeling at Instagram. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4717–4727
2025
-
[17]
Yue Meng, Cheng Guo, Xiaohui Hu, Honghu Deng, Yi Cao, Tong Liu, and Bo Zheng. 2025. User Long-Term Multi-Interest Retrieval Model for Recommenda- tion. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 1112–1116
2025
-
[18]
Derek G Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. 2021. tf.data: A Ma- chine Learning Data Processing Framework.Proceedings of the VLDB Endowment 14, 12 (2021)
2021
-
[19]
Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree).Acta Inf.33, 4 (June 1996), 351–385. doi:10. 1007/s002360050048
1996
-
[20]
Pedro Pedreira, Orri Erber, Masha Kandula, Kevin Haas, Yolanda Hao, Anja Gruenheid, Deepak Nair, Hao Liu, Huameng Zhu, Wenlei Fan, et al. 2022. Velox: Meta’s Unified Execution Engine.Proceedings of the VLDB Endowment15, 12 (2022), 3372–3384
2022
- [21]
-
[22]
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. InProceedings of the 2017 ACM International Conference on Management of Data
2017
- [23]
-
[24]
Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet- mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet- mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in Machine learning systems. InPro- ceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2(Montreal, Canada)(NIPS’15). MIT Pres...
2015
-
[25]
Zihua Si, Lin Guan, ZhongXiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, et al . 2024. Twin v2: Scaling ultra-long user behavior sequence modeling for enhanced ctr predic- tion at kuaishou. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 4890–4897
2024
-
[26]
Ashish Thusoo, Joydeep Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive - A Ware- housing Solution Over a Map-Reduce Framework.PVLDB2 (08 2009), 1626–1629. doi:10.14778/1687553.1687609
-
[27]
Taegeon Um et al. 2023. FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline.Proceedings of the VLDB Endowment 16, 5 (2023)
2023
-
[28]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Gener- ative Recommendations. InProceedings of the 41st International Conference on Machine Learning. PMLR, 58484–58509
2024
-
[29]
Mark Zhao, Niket Tirmazi, Jiyan Erber, Skye Ihm, Aarti Minnich, Shashank Nair, and Dawn Sun. 2022. Understanding Data Storage and Ingestion for Large- Scale Deep Recommendation Model Training. InProceedings of the 49th Annual International Symposium on Computer Architecture. 1042–1057
2022
-
[30]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Changhua Zhou, Xiao- qiang Zhu, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.