pith. sign in

arxiv: 2604.24806 · v1 · submitted 2026-04-27 · 💻 cs.IR · cs.AI· cs.DB

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

Pith reviewed 2026-05-08 01:57 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.DB
keywords late materializationrecommendation systemsDLRMultra-long sequencesdata infrastructureversioned pointersuser interaction historymulti-tenant
0
0 comments X

The pith

Versioned late materialization stores user histories once to support ultra-long sequences in recommendation model training without storage blowup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recommendation models improve when trained on longer sequences of user interactions, but the standard approach of embedding those sequences into every training example creates massive redundant storage and slows down data movement. This paper introduces a system that keeps the full histories in one place and builds the needed sequences only when a training batch is being prepared. The method uses versioned pointers to track the right data versions and includes safeguards to keep online and offline data consistent. Optimizations in preprocessing and data access hide the extra work of building sequences at runtime, so the GPUs stay busy computing rather than waiting for data. In production use, this has allowed longer sequences that boost model accuracy while lowering the overall data infrastructure demands.

Core claim

The paper establishes that versioned late materialization eliminates data redundancy in training ultra-long sequences for deep learning recommendation models by storing immutable user interaction histories in a normalized tier and reconstructing sequences on demand with lightweight versioned pointers, supported by a bifurcated consistency protocol and projection pushdown, while disaggregated preprocessing keeps the system efficient.

What carries the argument

Versioned late materialization, a paradigm that stores user interaction history once in an immutable normalized store and uses versioned pointers for just-in-time sequence reconstruction during training.

If this is right

  • Storage scales with unique user histories rather than with every training example's sequence.
  • Multiple models with varying sequence length needs share one dataset without data duplication.
  • Ultra-long sequences become feasible, leading to quality improvements in deployed models.
  • Training remains compute-bound despite on-the-fly reconstruction thanks to I/O masking.
  • Provides foundational infrastructure for sequence-heavy architectures like HSTU.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be adapted to other domains with shared long-context data, such as sequential recommendation in other fields or long-context language modeling.
  • It opens the possibility for more granular control over sequence lengths per training batch without additional storage costs.
  • The immutable tier might enable efficient versioning for auditing or rollback in data pipelines.

Load-bearing premise

The assumption that pipelined I/O prefetching and data-affinity optimizations in disaggregated preprocessing can fully hide the latency of reconstructing sequences at training time.

What would settle it

Observing that training throughput becomes limited by data I/O or reconstruction overhead rather than GPU compute as sequence lengths increase in the production deployment would show the performance masking does not hold.

Figures

Figures reproduced from arXiv: 2604.24806 by Chufeng Hu, Ge Song, Jianhui Sun, Liang Guo, Litao Deng, Lu Zhang, Sarang Masti Sreeshylan, Shouwei Chen, Weiran Liu, Xiaoxuan Meng, Zhen Ma.

Figure 1
Figure 1. Figure 1: Feature snapshotting and pre-materialization ar view at source ↗
Figure 2
Figure 2. Figure 2: Estimation of data supporting service and GPU view at source ↗
Figure 3
Figure 3. Figure 3: Versioned Late Materialization Protocol the storage volume and the read traffic from the training fleet and the online ranking service. 4.1.2 Read-Optimized Immutable Storage. The immutable UIH store serves as the normalized repository for long-term user interaction history, eliminating the systemic redundancy of the Fat Row par￾adigm. Its design is driven by a single objective: maximizing read throughput … view at source ↗
Figure 4
Figure 4. Figure 4: NE improvement of UIH sequence length scaling view at source ↗
read the original abstract

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents a versioned late materialization paradigm to address the storage and I/O challenges in training deep learning recommendation models (DLRMs) with ultra-long user interaction sequences. By storing sequences in a normalized immutable tier and using lightweight versioned pointers for just-in-time reconstruction during training, it aims to eliminate redundancy in the 'Fat Row' approach, particularly in multi-tenant settings. The system incorporates a bifurcated protocol for online-to-offline consistency and disaggregated preprocessing with pipelined I/O to keep training GPU compute-bound. It claims production deployment success in reducing data infrastructure usage and enabling sequence scaling for quality improvements in models like HSTU and ULTRA-HSTU.

Significance. Should the I/O masking and consistency guarantees hold under production conditions, this architecture could be highly significant for the field of recommendation systems by allowing efficient scaling of sequence lengths without exploding storage and I/O costs. It provides a foundational data infrastructure that supports advanced model architectures, potentially leading to better model quality at scale.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs' is load-bearing for the central claim of reduced infrastructure usage without throughput loss, yet no quantitative evidence (reconstruction latency distributions, prefetch hit rates, tail I/O stalls, or GPU utilization under heterogeneous tenant loads) is supplied to support it when sequences reach ultra-long regimes.
  2. [Abstract] Abstract: The claims that the system 'reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains' and serves as 'foundational data infrastructure' for HSTU/ULTRA-HSTU rest on deployment assertions but supply no measurements, baselines, error bars, or ablation results, making the practical impact unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We have revised the manuscript to address the concerns by adding explicit references to the quantitative results and supporting sections in the body of the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs' is load-bearing for the central claim of reduced infrastructure usage without throughput loss, yet no quantitative evidence (reconstruction latency distributions, prefetch hit rates, tail I/O stalls, or GPU utilization under heterogeneous tenant loads) is supplied to support it when sequences reach ultra-long regimes.

    Authors: We agree that the abstract would benefit from direct links to the supporting data. In the revised manuscript, we have updated the abstract to reference Section 5, which presents reconstruction latency distributions, prefetch hit rates, tail I/O stall analysis, and GPU utilization metrics under heterogeneous tenant loads for ultra-long sequences. These results substantiate that the I/O latency is masked and training remains compute-bound. revision: yes

  2. Referee: [Abstract] Abstract: The claims that the system 'reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains' and serves as 'foundational data infrastructure' for HSTU/ULTRA-HSTU rest on deployment assertions but supply no measurements, baselines, error bars, or ablation results, making the practical impact unverifiable.

    Authors: The referee is correct that the abstract lacks specific metrics. We have revised the abstract to include high-level references to the observed infrastructure reductions and quality improvements, with explicit pointers to Sections 6 and 7. Those sections contain the full measurements, baselines, error bars, and ablation results from our production deployment that support the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: systems architecture description with no derivations or self-referential fits

full rationale

The paper describes a data infrastructure architecture (versioned late materialization, bifurcated O2O protocol, disaggregated preprocessing with pipelined prefetching) for DLRM training. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. Claims rest on engineering design choices and production deployment outcomes rather than any reduction of a result to its own inputs by construction. The central performance assertion (reconstruction latency masked to keep training GPU-bound) is an empirical systems claim, not a mathematical derivation that loops back on itself. This matches the default expectation for non-circular systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the ability to hide reconstruction latency and maintain strict O2O consistency without introducing new bottlenecks; these rest on domain assumptions about distributed storage and training pipelines rather than new axioms or fitted constants.

axioms (2)
  • domain assumption Bifurcated protocol prevents future leakage across streaming and batch training
    Invoked to guarantee O2O consistency when reconstructing sequences from immutable storage.
  • domain assumption Read-optimized immutable storage layer supports multi-dimensional projection pushdown
    Required for efficient heterogeneous model tenant access without full materialization.
invented entities (1)
  • versioned pointers no independent evidence
    purpose: Lightweight references that enable just-in-time sequence reconstruction from normalized immutable storage
    Core mechanism introduced to eliminate data redundancy while preserving version history.

pith-pipeline@v0.9.0 · 5580 in / 1412 out tokens · 73801 ms · 2026-05-08T01:57:22.334665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 12 canonical work pages

  1. [1]

    Daniel J Abadi, Daniel S Myers, David J DeWitt, and Samuel Madden. 2007. Materialization Strategies in a Column-Oriented DBMS. InProceedings of the 23rd IEEE International Conference on Data Engineering

  2. [2]

    Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 2007. A Critique of ANSI SQL Isolation Levels. arXiv:cs/0701157 [cs.DB] https://arxiv.org/abs/cs/0701157

  3. [3]

    Philip Bernstein and Nathan Goodman. 1983. Multiversion Concurrency Control - Theory and Algorithms.ACM Trans. Database Syst.8 (12 1983), 465–483. doi:10.1145/319996.319998

  4. [4]

    Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al . 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256

  5. [5]

    Qiwei Chen, Changhua Pei, Shuguang Lv, Cheng Li, Jian Ge, and Wenwu Ou

  6. [6]

    End-to-End User Behavior Retrieval in Click-Through Rate Prediction Model.arXiv preprint arXiv:2108.04468(2021)

  7. [7]

    Lee, Khush- hall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, and Wen-Yun Yang

    Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H. Lee, Khush- hall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, and Wen-Yun Yang

  8. [8]

    arXiv:2510.22049 [cs.IR] https://arxiv.org/abs/2510.22049

    Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders. arXiv:2510.22049 [cs.IR] https://arxiv.org/abs/2510.22049

  9. [9]

    Qin Ding, Kevin Course, Linjian Ma, Jianhui Sun, Ruochen Liu, Zhao Zhu, Chunx- ing Yin, Wei Li, Dai Li, Yu Shi, Xuan Cao, Ze Yang, Han Li, Xing Liu, Bi Xue, Hongwei Li, Rui Jian, Daisy Shi He, Jing Qian, Matt Ma, Qunshu Zhang, and Rui Li. 2026. Bending the Scaling Law Curve in Large-Scale Recommendation Systems. arXiv:2602.16986 [cs.IR] https://arxiv.org/...

  10. [10]

    Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications.ACM Trans. Storage17, 4, Article 26 (Oct. 2021), 32 pages. doi:10. 1145/3483840

  11. [11]

    Liang Guo, Wei Li, Lucy Liao, Huihui Cheng, Rui Zhang, Yu Shi, Yueming Wang, Yanzun Huang, Keke Zhai, Pengchao Wang, et al. 2025. Request-Only Optimiza- tion for Recommendation Systems.arXiv preprint arXiv:2508.05640(2025)

  12. [12]

    Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atber, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. InProceedings of the Eighth International Workshop on Data Mining for Online Advertising. 1–9

  13. [13]

    Manos Karpathiotakis, Vlassios Rizopoulos, Basri Kahveci, Tiziano Carotti, Artem Gelum, Hazem Nada, and Yuri Dolgov. 2025. Scribe: How Meta Transports Terabytes per Second in Real Time.Proceedings of the VLDB Endowment18 (09 2025), 4817–4830. doi:10.14778/3750601.3750607

  14. [14]

    Dai Li, Kevin Course, Wei Li, Hongwei Li, Jie Hua, Yiqi Chen, Zhao Zhu, Rui Jian, Xuan Cao, Bi Xue, Yu Shi, Jing Qian, Kai Ren, Matt Ma, Qunshu Zhang, and Rui Li. 2025. Realizing Scaling Laws in Recommender Systems: A Foundation- Expert Paradigm for Hyperscale Model Deployment. arXiv:2508.02929 [cs.IR] https://arxiv.org/abs/2508.02929

  15. [15]

    Gang Liao, Ye Liu, Jianjun Chen, and Daniel J. Abadi. 2024. Bullion: A Column Store for Machine Learning. arXiv:2404.08901 [cs.DB] https://arxiv.org/abs/2404. 08901

  16. [16]

    Wenhan Lyu, Devashish Tyagi, Yihang Yang, Ziwei Li, Ajay Somani, Karthikeyan Shanmugasundaram, Nikola Andrejevic, Ferdi Adeputra, Curtis Zeng, Arun K Singh, et al. 2025. DV365: Extremely Long User History Modeling at Instagram. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4717–4727

  17. [17]

    Yue Meng, Cheng Guo, Xiaohui Hu, Honghu Deng, Yi Cao, Tong Liu, and Bo Zheng. 2025. User Long-Term Multi-Interest Retrieval Model for Recommenda- tion. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 1112–1116

  18. [18]

    Derek G Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. 2021. tf.data: A Ma- chine Learning Data Processing Framework.Proceedings of the VLDB Endowment 14, 12 (2021)

  19. [19]

    Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree).Acta Inf.33, 4 (June 1996), 351–385. doi:10. 1007/s002360050048

  20. [20]

    Pedro Pedreira, Orri Erber, Masha Kandula, Kevin Haas, Yolanda Hao, Anja Gruenheid, Deepak Nair, Hao Liu, Huameng Zhu, Wenlei Fan, et al. 2022. Velox: Meta’s Unified Execution Engine.Proceedings of the VLDB Endowment15, 12 (2022), 3372–3384

  21. [21]

    Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction.arXiv preprint arXiv:2006.05639(2020)

  22. [22]

    Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. InProceedings of the 2017 ACM International Conference on Management of Data

  23. [23]

    Qin Ren, Zheng Chai, Xijun Xiao, Yuchao Zheng, and Di Wu. 2025. LongRetriever: Towards Ultra-Long Sequence based Candidate Retrieval for Recommendation. arXiv preprint arXiv:2508.15486(2025)

  24. [24]

    Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet- mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison

    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet- mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in Machine learning systems. InPro- ceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2(Montreal, Canada)(NIPS’15). MIT Pres...

  25. [25]

    Zihua Si, Lin Guan, ZhongXiang Sun, Xiaoxue Zang, Jing Lu, Yiqun Hui, Xingchao Cao, Zeyu Yang, Yichen Zheng, Dewei Leng, et al . 2024. Twin v2: Scaling ultra-long user behavior sequence modeling for enhanced ctr predic- tion at kuaishou. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 4890–4897

  26. [26]

    Ashish Thusoo, Joydeep Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive - A Ware- housing Solution Over a Map-Reduce Framework.PVLDB2 (08 2009), 1626–1629. doi:10.14778/1687553.1687609

  27. [27]

    Taegeon Um et al. 2023. FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline.Proceedings of the VLDB Endowment 16, 5 (2023)

  28. [28]

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Gener- ative Recommendations. InProceedings of the 41st International Conference on Machine Learning. PMLR, 58484–58509

  29. [29]

    Mark Zhao, Niket Tirmazi, Jiyan Erber, Skye Ihm, Aarti Minnich, Shashank Nair, and Dawn Sun. 2022. Understanding Data Storage and Ingestion for Large- Scale Deep Recommendation Model Training. InProceedings of the 49th Annual International Symposium on Computer Architecture. 1042–1057

  30. [30]

    Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Changhua Zhou, Xiao- qiang Zhu, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068