Memento: Personalized RAG-Style Long-Retention Data Scaling for META Ads Recommendation
Pith reviewed 2026-06-30 15:19 UTC · model grok-4.3
The pith
Memento retrieves relevant past user engagements via MMR to augment features and training data, scaling personalization to 365+ days at sub-10ms latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memento is a personalized RAG-style framework that models long user history by retrieving relevant past engagements with Maximal Marginal Relevance rather than scaling context linearly. Representation Memento pulls historical embeddings to enrich features for the current request. Data Memento pulls past training examples for repeated passes over selected data. Co-designed infrastructure of temporal chunking, INT8 quantization, and asynchronous serving yields 5-10x efficiency gains. The system processes requests at sub-10ms latency, delivers 0.25-0.3% Normalized Entropy improvement on both CTR and CVR prediction, and produces 1% CTR lift on Feed and Reels plus 1.2% CVR lift while scaling to 3
What carries the argument
Maximal Marginal Relevance (MMR) retrieval over a corpus of historical user engagements, applied for feature augmentation in Representation Memento and for example selection in Data Memento.
If this is right
- Representation Memento adds retrieved historical embeddings to the current feature vector and improves prediction quality.
- Data Memento selects past training examples for multipass training and yields additional gains without full-history processing.
- Temporal chunking combined with INT8 quantization and asynchronous serving achieves 5-10x resource efficiency over linear scaling.
- Daily request processing stays under 10ms latency while producing 0.25-0.3% Normalized Entropy gains on click-through and conversion tasks.
- Production deployment on Facebook Feed and Reels shows 1% CTR lift and 1.2% CVR lift with support for 365+ days of history.
Where Pith is reading between the lines
- The same retrieval pattern could be tested on other sequential prediction tasks where long user history is available but full-context attention is costly.
- Selective replay of retrieved examples may offer an alternative route to mitigating catastrophic forgetting without explicit regularization terms.
- The infrastructure co-design choices suggest the framework could be adapted to serving environments with tighter memory or latency budgets.
- Comparing MMR against other diversity-aware retrievers in the same production setup would isolate how much the diversity component contributes to the observed lifts.
Load-bearing premise
That MMR retrieval of historical engagements supplies training signals and features that improve the downstream model without introducing selection bias or distribution shift.
What would settle it
An A/B experiment that disables the MMR retrieval step while holding all other model and serving components fixed, then measures whether the reported CTR and CVR lifts disappear.
read the original abstract
Modeling of long history data suffers from long-context window attention dilution, system efficiency and catastrophic forgetting problems, where naive linear scaling approach like LastN would fail. We introduce Memento, a personalized retrieval-augmented framework that treats historical user engagements as a document corpus and ad requests as queries, retrieving relevant interactions via Maximal Marginal Relevance (MMR) to balance similarity with diversity. We identify two complementary applications: Representation Memento, which retrieves historical embeddings for feature augmentation, and Data Memento, which retrieves past training examples for multipass training. Through infrastructure co-design -- temporal chunking, INT8 quantization, and asynchronous serving -- Memento achieves 5-10$\times$ resource efficiency over linear scaling. Memento processes daily requests with sub-10ms latency, yielding 0.25-0.3% Normalized Entropy gain on both click-through and conversion prediction. In production, Memento delivers a 1% CTR lift on Facebook Feed and Reels and a 1.2% CVR lift, scaling personalization to 365+ days of history.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Memento, a personalized RAG-style framework for long-retention data scaling in META ads recommendation. It models historical user engagements as a document corpus and ad requests as queries, using Maximal Marginal Relevance (MMR) to retrieve relevant interactions for two applications: Representation Memento for feature augmentation and Data Memento for multipass training. The framework incorporates infrastructure co-design for efficiency and reports sub-10ms latency, 0.25-0.3% Normalized Entropy gains, and production lifts of 1% CTR and 1.2% CVR, scaling to 365+ days of history.
Significance. If the reported gains are attributable to the proposed method rather than confounding factors, Memento could offer a practical solution for scaling personalization in large-scale recommendation systems without the costs of linear history scaling. The infrastructure co-design elements like temporal chunking, INT8 quantization, and asynchronous serving represent a strength in achieving efficiency gains.
major comments (3)
- [Abstract] Abstract: The production A/B test results reporting 1% CTR lift on Facebook Feed and Reels and 1.2% CVR lift are presented without details on experimental controls, baseline comparisons, statistical significance, or how data exclusion/normalization choices affect the gains. This is load-bearing for the central claim of improvement.
- [Representation Memento and Data Memento sections] Representation Memento and Data Memento sections: The description invokes MMR for relevance-diversity balance but supplies no covariate-shift diagnostics, propensity reweighting, or ablation isolating retrieval-induced bias from the claimed benefit. If MMR over-samples high-engagement items, the gains could be artifacts of changed training distribution.
- [Infrastructure co-design section] Infrastructure co-design section: The claim of 5-10× resource efficiency over linear scaling is made but without specific quantitative comparisons or metrics showing the contribution of each technique (temporal chunking, INT8 quantization, asynchronous serving).
minor comments (1)
- [Abstract] Abstract: The term 'Normalized Entropy gain' is used without defining the baseline entropy or how it is normalized in this context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting areas where additional rigor would strengthen the central claims. We respond point-by-point to the major comments below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract: The production A/B test results reporting 1% CTR lift on Facebook Feed and Reels and 1.2% CVR lift are presented without details on experimental controls, baseline comparisons, statistical significance, or how data exclusion/normalization choices affect the gains. This is load-bearing for the central claim of improvement.
Authors: We agree that the abstract would benefit from additional context on the A/B test design. In the revision we will expand the abstract and add a short experimental-setup paragraph noting that the baseline is the production model using limited-history features (LastN), that tests ran for multiple weeks on Feed and Reels traffic, and that lifts are reported as relative improvements under standard Meta A/B protocols. Full internal controls and exact p-values remain subject to confidentiality constraints and cannot be disclosed. revision: partial
-
Referee: [Representation Memento and Data Memento sections] Representation Memento and Data Memento sections: The description invokes MMR for relevance-diversity balance but supplies no covariate-shift diagnostics, propensity reweighting, or ablation isolating retrieval-induced bias from the claimed benefit. If MMR over-samples high-engagement items, the gains could be artifacts of changed training distribution.
Authors: We will add an ablation in the revised manuscript that compares MMR retrieval against plain top-k similarity retrieval and reports engagement-rate histograms for the retrieved versus original data. These diagnostics show that MMR’s diversity term prevents over-sampling of high-engagement items. Because retrieval is performed per-user on the user’s own history, the overall training distribution remains representative; therefore full propensity reweighting is not required to isolate the benefit. revision: yes
-
Referee: [Infrastructure co-design section] Infrastructure co-design section: The claim of 5-10× resource efficiency over linear scaling is made but without specific quantitative comparisons or metrics showing the contribution of each technique (temporal chunking, INT8 quantization, asynchronous serving).
Authors: We will expand the infrastructure section with a table that quantifies the contribution of each technique relative to a linear-scaling baseline, using internal benchmark numbers (temporal chunking for indexing efficiency, INT8 quantization for memory reduction, and asynchronous serving for latency). This will make the 5-10× claim directly traceable to the individual components. revision: yes
- Complete disclosure of proprietary A/B-test controls, exact statistical tests, and internal normalization procedures due to Meta confidentiality policies.
Circularity Check
No significant circularity; empirical production results are externally measured
full rationale
The paper presents an engineering framework (MMR-based retrieval for long-history personalization) and reports measured outcomes from production A/B tests (1% CTR lift, 1.2% CVR lift, 0.25-0.3% NE gain) plus infrastructure metrics (sub-10ms latency). No mathematical derivation chain, equations, or fitted parameters are shown that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The core claims rest on external system measurements rather than tautological re-labeling of training signals or ansatzes. This is the normal case of a self-contained applied paper whose results can be falsified outside its own fitted values.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer. InarXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
Context length alone hurts llm performance despite perfect retrieval,
Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite per- fect retrieval.arXiv preprint arXiv:2510.05381,
-
[3]
Zhongxiang Fan, Zhaocheng Liu, Jian Liang, Dongy- ing Kong, Han Li, Peng Jiang, Shuang Li, and Kun Gai. Multi-epoch learning with data augmentation for deep click-through rate prediction.arXiv preprint arXiv:2407.01607,
-
[4]
Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal.arXiv preprint arXiv:2403.01244,
-
[5]
David Lopez-Paz and Marc’Aurelio Ranzato
doi: 10.1145/ 3711896.3737106. David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InNeurIPS,
-
[6]
Memorizing transformers.arXiv preprint arXiv:2203.08913,
Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers.arXiv preprint arXiv:2203.08913,
-
[7]
Xiaoyue Xu, Qinyuan Ye, and Xiang Ren
https://arxiv.org/ abs/2601.20083. Xiaoyue Xu, Qinyuan Ye, and Xiang Ren. Stress-testing long-context language models with lifelong icl and task haystack.Advances in Neural Information Processing Systems, 37:15801–15840,
-
[8]
Lu Yin YongjiWu, Defu Lian, Mingyang Yin, Neil Zhen- qiang Gong, Jingren Zhou, and Hongxia Yang. Re- thinking lifelong sequential recommendation with in- cremental multi-interest attention.arXiv preprint arXiv:2105.14060,
-
[9]
Llm augmen- tations to support analytical reasoning over multiple documents
Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, and Naren Ramakrishnan. Llm augmen- tations to support analytical reasoning over multiple documents. InIEEE Big Data 2024,
2024
-
[10]
Accepted in IEEE Big Data 2024.https://arxiv.org/html/2411.16116v1
https: //arxiv.org/html/2411.16116v1. Accepted in IEEE Big Data 2024.https://arxiv.org/html/2411.16116v1. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Peter Pham, AnirudhRavula, QifanWang, LiYang, andAmr Ahmed. Big bird: Transformers for longer sequences. InNeurIPS,
-
[11]
Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. Dhen: A deep and hierar- chical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022a. Wei Zhang, Dai Li, Chen Liang, Fang Zhou, Zhongke Zhang, Xuewei Wang, Ru Li, Yi Zhou, Yaning H...
-
[12]
Towards understanding the overfitting phenomenon of deep click-through rate prediction models
Zhao-Yu Zhang, Xiang-Rong Sheng, Yujing Zhang, Biye Jiang, Shuguang Han, Hongbo Deng, and Bo Zheng. Towards understanding the overfitting phenomenon of deep click-through rate prediction models. InPro- ceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), 11 pages 2202–2212, Atlanta, GA, USA, 2022b. ACM. doi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.