Memento: Personalized RAG-Style Long-Retention Data Scaling for META Ads Recommendation

Arnold Overwijk; Dinesh Ramasamy; Dorothy Sun; Jieming Di; Junwei Xiong; Lilly Kumari; Ling Leng; Nafis Abrar; Pan Chen; Qiao Yang

arxiv: 2605.24051 · v1 · pith:ZYXJPSHRnew · submitted 2026-05-22 · 💻 cs.IR

Memento: Personalized RAG-Style Long-Retention Data Scaling for META Ads Recommendation

Xiaoyu Chen , Ruichen Wang , Jieming Di , Suofei Feng , Nafis Abrar , Lilly Kumari , Tony Tsui , Yilin Liu

show 16 more authors

Yu Lu Sowmya Patapati Junwei Xiong Qiao Yang Dorothy Sun Yang Cao Victor Chen Pan Chen Ramsundar Sundarkumar Shivendra Pratap Singh Arnold Overwijk Ling Leng Dinesh Ramasamy Sri Reddy Robert Malkin Sandeep Pandey

This is my paper

Pith reviewed 2026-06-30 15:19 UTC · model grok-4.3

classification 💻 cs.IR

keywords personalized recommendationretrieval-augmented generationlong-term user historyads click-through rateconversion predictionMaximal Marginal Relevancefeature augmentationmultipass training

0 comments

The pith

Memento retrieves relevant past user engagements via MMR to augment features and training data, scaling personalization to 365+ days at sub-10ms latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Memento as a retrieval-augmented system that treats a user's historical ad engagements as a document corpus and each new request as a query. It retrieves a balanced set of past interactions using Maximal Marginal Relevance to avoid both irrelevance and redundancy. These retrieved items feed two separate uses: one augments the current feature representation and the other supplies additional training examples for multipass learning. The design includes temporal chunking, quantization, and asynchronous serving to keep serving costs low. Production results show small gains in normalized entropy on click and conversion models plus measurable lifts in CTR and CVR while supporting a full year of history.

Core claim

Memento is a personalized RAG-style framework that models long user history by retrieving relevant past engagements with Maximal Marginal Relevance rather than scaling context linearly. Representation Memento pulls historical embeddings to enrich features for the current request. Data Memento pulls past training examples for repeated passes over selected data. Co-designed infrastructure of temporal chunking, INT8 quantization, and asynchronous serving yields 5-10x efficiency gains. The system processes requests at sub-10ms latency, delivers 0.25-0.3% Normalized Entropy improvement on both CTR and CVR prediction, and produces 1% CTR lift on Feed and Reels plus 1.2% CVR lift while scaling to 3

What carries the argument

Maximal Marginal Relevance (MMR) retrieval over a corpus of historical user engagements, applied for feature augmentation in Representation Memento and for example selection in Data Memento.

If this is right

Representation Memento adds retrieved historical embeddings to the current feature vector and improves prediction quality.
Data Memento selects past training examples for multipass training and yields additional gains without full-history processing.
Temporal chunking combined with INT8 quantization and asynchronous serving achieves 5-10x resource efficiency over linear scaling.
Daily request processing stays under 10ms latency while producing 0.25-0.3% Normalized Entropy gains on click-through and conversion tasks.
Production deployment on Facebook Feed and Reels shows 1% CTR lift and 1.2% CVR lift with support for 365+ days of history.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval pattern could be tested on other sequential prediction tasks where long user history is available but full-context attention is costly.
Selective replay of retrieved examples may offer an alternative route to mitigating catastrophic forgetting without explicit regularization terms.
The infrastructure co-design choices suggest the framework could be adapted to serving environments with tighter memory or latency budgets.
Comparing MMR against other diversity-aware retrievers in the same production setup would isolate how much the diversity component contributes to the observed lifts.

Load-bearing premise

That MMR retrieval of historical engagements supplies training signals and features that improve the downstream model without introducing selection bias or distribution shift.

What would settle it

An A/B experiment that disables the MMR retrieval step while holding all other model and serving components fixed, then measures whether the reported CTR and CVR lifts disappear.

read the original abstract

Modeling of long history data suffers from long-context window attention dilution, system efficiency and catastrophic forgetting problems, where naive linear scaling approach like LastN would fail. We introduce Memento, a personalized retrieval-augmented framework that treats historical user engagements as a document corpus and ad requests as queries, retrieving relevant interactions via Maximal Marginal Relevance (MMR) to balance similarity with diversity. We identify two complementary applications: Representation Memento, which retrieves historical embeddings for feature augmentation, and Data Memento, which retrieves past training examples for multipass training. Through infrastructure co-design -- temporal chunking, INT8 quantization, and asynchronous serving -- Memento achieves 5-10$\times$ resource efficiency over linear scaling. Memento processes daily requests with sub-10ms latency, yielding 0.25-0.3% Normalized Entropy gain on both click-through and conversion prediction. In production, Memento delivers a 1% CTR lift on Facebook Feed and Reels and a 1.2% CVR lift, scaling personalization to 365+ days of history.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Memento pairs MMR retrieval with two uses in ads recsys for long history, plus infra tweaks for efficiency, but the abstract gives no experimental controls or bias checks.

read the letter

The main point is that this paper describes Memento, which treats user history as a document store and pulls relevant past engagements with Maximal Marginal Relevance. It applies the retrieval in two ways: one to augment current features with historical embeddings, and another to select past examples for additional training passes. They also outline temporal chunking, quantization, and async serving to cut resource use by 5-10x while keeping latency under 10ms.

The dual application and the concrete efficiency numbers are the clearest new pieces. Scaling to 365 days of history without blowing up compute is a real production constraint, and the reported 0.25-0.3% normalized entropy gain plus the 1% CTR and 1.2% CVR lifts on Feed and Reels are the kind of outcomes that matter at Meta scale.

The soft spot is exactly the one flagged in the stress test. The abstract invokes MMR for relevance-diversity balance but supplies no diagnostics on whether the selected subset shifts the training distribution or introduces selection bias. Without ablations, propensity weighting, or even basic baseline comparisons, it is impossible to tell whether the lifts come from better long-history use or from the retrieval changing what the model sees. The production A/B numbers are presented without enough context to evaluate them.

This is written for people who build large-scale recommendation systems and need practical ways to handle year-long user data. It could be useful to that group for the infrastructure ideas. For anyone outside that setting, the missing methodology makes it hard to assess.

If the full paper adds the controls and bias checks, it deserves a serious referee. From the abstract alone the central claims rest on an untested assumption about the retrieval step.

Referee Report

3 major / 1 minor

Summary. The paper introduces Memento, a personalized RAG-style framework for long-retention data scaling in META ads recommendation. It models historical user engagements as a document corpus and ad requests as queries, using Maximal Marginal Relevance (MMR) to retrieve relevant interactions for two applications: Representation Memento for feature augmentation and Data Memento for multipass training. The framework incorporates infrastructure co-design for efficiency and reports sub-10ms latency, 0.25-0.3% Normalized Entropy gains, and production lifts of 1% CTR and 1.2% CVR, scaling to 365+ days of history.

Significance. If the reported gains are attributable to the proposed method rather than confounding factors, Memento could offer a practical solution for scaling personalization in large-scale recommendation systems without the costs of linear history scaling. The infrastructure co-design elements like temporal chunking, INT8 quantization, and asynchronous serving represent a strength in achieving efficiency gains.

major comments (3)

[Abstract] Abstract: The production A/B test results reporting 1% CTR lift on Facebook Feed and Reels and 1.2% CVR lift are presented without details on experimental controls, baseline comparisons, statistical significance, or how data exclusion/normalization choices affect the gains. This is load-bearing for the central claim of improvement.
[Representation Memento and Data Memento sections] Representation Memento and Data Memento sections: The description invokes MMR for relevance-diversity balance but supplies no covariate-shift diagnostics, propensity reweighting, or ablation isolating retrieval-induced bias from the claimed benefit. If MMR over-samples high-engagement items, the gains could be artifacts of changed training distribution.
[Infrastructure co-design section] Infrastructure co-design section: The claim of 5-10× resource efficiency over linear scaling is made but without specific quantitative comparisons or metrics showing the contribution of each technique (temporal chunking, INT8 quantization, asynchronous serving).

minor comments (1)

[Abstract] Abstract: The term 'Normalized Entropy gain' is used without defining the baseline entropy or how it is normalized in this context.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting areas where additional rigor would strengthen the central claims. We respond point-by-point to the major comments below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: The production A/B test results reporting 1% CTR lift on Facebook Feed and Reels and 1.2% CVR lift are presented without details on experimental controls, baseline comparisons, statistical significance, or how data exclusion/normalization choices affect the gains. This is load-bearing for the central claim of improvement.

Authors: We agree that the abstract would benefit from additional context on the A/B test design. In the revision we will expand the abstract and add a short experimental-setup paragraph noting that the baseline is the production model using limited-history features (LastN), that tests ran for multiple weeks on Feed and Reels traffic, and that lifts are reported as relative improvements under standard Meta A/B protocols. Full internal controls and exact p-values remain subject to confidentiality constraints and cannot be disclosed. revision: partial
Referee: [Representation Memento and Data Memento sections] Representation Memento and Data Memento sections: The description invokes MMR for relevance-diversity balance but supplies no covariate-shift diagnostics, propensity reweighting, or ablation isolating retrieval-induced bias from the claimed benefit. If MMR over-samples high-engagement items, the gains could be artifacts of changed training distribution.

Authors: We will add an ablation in the revised manuscript that compares MMR retrieval against plain top-k similarity retrieval and reports engagement-rate histograms for the retrieved versus original data. These diagnostics show that MMR’s diversity term prevents over-sampling of high-engagement items. Because retrieval is performed per-user on the user’s own history, the overall training distribution remains representative; therefore full propensity reweighting is not required to isolate the benefit. revision: yes
Referee: [Infrastructure co-design section] Infrastructure co-design section: The claim of 5-10× resource efficiency over linear scaling is made but without specific quantitative comparisons or metrics showing the contribution of each technique (temporal chunking, INT8 quantization, asynchronous serving).

Authors: We will expand the infrastructure section with a table that quantifies the contribution of each technique relative to a linear-scaling baseline, using internal benchmark numbers (temporal chunking for indexing efficiency, INT8 quantization for memory reduction, and asynchronous serving for latency). This will make the 5-10× claim directly traceable to the individual components. revision: yes

standing simulated objections not resolved

Complete disclosure of proprietary A/B-test controls, exact statistical tests, and internal normalization procedures due to Meta confidentiality policies.

Circularity Check

0 steps flagged

No significant circularity; empirical production results are externally measured

full rationale

The paper presents an engineering framework (MMR-based retrieval for long-history personalization) and reports measured outcomes from production A/B tests (1% CTR lift, 1.2% CVR lift, 0.25-0.3% NE gain) plus infrastructure metrics (sub-10ms latency). No mathematical derivation chain, equations, or fitted parameters are shown that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The core claims rest on external system measurements rather than tautological re-labeling of training signals or ansatzes. This is the normal case of a self-contained applied paper whose results can be falsified outside its own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; no explicit fitted values or new postulated constructs are described.

pith-pipeline@v0.9.1-grok · 5810 in / 1282 out tokens · 25114 ms · 2026-06-30T15:19:29.431112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer. InarXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Context length alone hurts llm performance despite perfect retrieval,

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite per- fect retrieval.arXiv preprint arXiv:2510.05381,

work page arXiv
[3]

Multi-epoch learning with data augmentation for deep click-through rate prediction.arXiv preprint arXiv:2407.01607,

Zhongxiang Fan, Zhaocheng Liu, Jian Liang, Dongy- ing Kong, Han Li, Peng Jiang, Shuang Li, and Kun Gai. Multi-epoch learning with data augmentation for deep click-through rate prediction.arXiv preprint arXiv:2407.01607,

work page arXiv
[4]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal.arXiv preprint arXiv:2403.01244,

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal.arXiv preprint arXiv:2403.01244,

work page arXiv
[5]

David Lopez-Paz and Marc’Aurelio Ranzato

doi: 10.1145/ 3711896.3737106. David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InNeurIPS,

work page arXiv
[6]

Memorizing transformers.arXiv preprint arXiv:2203.08913,

Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers.arXiv preprint arXiv:2203.08913,

work page arXiv
[7]

Xiaoyue Xu, Qinyuan Ye, and Xiang Ren

https://arxiv.org/ abs/2601.20083. Xiaoyue Xu, Qinyuan Ye, and Xiang Ren. Stress-testing long-context language models with lifelong icl and task haystack.Advances in Neural Information Processing Systems, 37:15801–15840,

work page arXiv
[8]

Re- thinking lifelong sequential recommendation with in- cremental multi-interest attention.arXiv preprint arXiv:2105.14060,

Lu Yin YongjiWu, Defu Lian, Mingyang Yin, Neil Zhen- qiang Gong, Jingren Zhou, and Hongxia Yang. Re- thinking lifelong sequential recommendation with in- cremental multi-interest attention.arXiv preprint arXiv:2105.14060,

work page arXiv
[9]

Llm augmen- tations to support analytical reasoning over multiple documents

Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, and Naren Ramakrishnan. Llm augmen- tations to support analytical reasoning over multiple documents. InIEEE Big Data 2024,

2024
[10]

Accepted in IEEE Big Data 2024.https://arxiv.org/html/2411.16116v1

https: //arxiv.org/html/2411.16116v1. Accepted in IEEE Big Data 2024.https://arxiv.org/html/2411.16116v1. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Peter Pham, AnirudhRavula, QifanWang, LiYang, andAmr Ahmed. Big bird: Transformers for longer sequences. InNeurIPS,

work page arXiv 2024
[11]

Dhen: A deep and hierar- chical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022a

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. Dhen: A deep and hierar- chical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022a. Wei Zhang, Dai Li, Chen Liang, Fang Zhou, Zhongke Zhang, Xuewei Wang, Ru Li, Yi Zhou, Yaning H...

work page arXiv 2024
[12]

Towards understanding the overfitting phenomenon of deep click-through rate prediction models

Zhao-Yu Zhang, Xiang-Rong Sheng, Yujing Zhang, Biye Jiang, Shuguang Han, Hongbo Deng, and Bo Zheng. Towards understanding the overfitting phenomenon of deep click-through rate prediction models. InPro- ceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), 11 pages 2202–2212, Atlanta, GA, USA, 2022b. ACM. doi...

work page doi:10.1145/3511808.3557479

[1] [1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer. InarXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

Context length alone hurts llm performance despite perfect retrieval,

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite per- fect retrieval.arXiv preprint arXiv:2510.05381,

work page arXiv

[3] [3]

Multi-epoch learning with data augmentation for deep click-through rate prediction.arXiv preprint arXiv:2407.01607,

Zhongxiang Fan, Zhaocheng Liu, Jian Liang, Dongy- ing Kong, Han Li, Peng Jiang, Shuang Li, and Kun Gai. Multi-epoch learning with data augmentation for deep click-through rate prediction.arXiv preprint arXiv:2407.01607,

work page arXiv

[4] [4]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal.arXiv preprint arXiv:2403.01244,

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal.arXiv preprint arXiv:2403.01244,

work page arXiv

[5] [5]

David Lopez-Paz and Marc’Aurelio Ranzato

doi: 10.1145/ 3711896.3737106. David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InNeurIPS,

work page arXiv

[6] [6]

Memorizing transformers.arXiv preprint arXiv:2203.08913,

Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers.arXiv preprint arXiv:2203.08913,

work page arXiv

[7] [7]

Xiaoyue Xu, Qinyuan Ye, and Xiang Ren

https://arxiv.org/ abs/2601.20083. Xiaoyue Xu, Qinyuan Ye, and Xiang Ren. Stress-testing long-context language models with lifelong icl and task haystack.Advances in Neural Information Processing Systems, 37:15801–15840,

work page arXiv

[8] [8]

Re- thinking lifelong sequential recommendation with in- cremental multi-interest attention.arXiv preprint arXiv:2105.14060,

Lu Yin YongjiWu, Defu Lian, Mingyang Yin, Neil Zhen- qiang Gong, Jingren Zhou, and Hongxia Yang. Re- thinking lifelong sequential recommendation with in- cremental multi-interest attention.arXiv preprint arXiv:2105.14060,

work page arXiv

[9] [9]

Llm augmen- tations to support analytical reasoning over multiple documents

Raquib Bin Yousuf, Nicholas Defelice, Mandar Sharma, Shengzhe Xu, and Naren Ramakrishnan. Llm augmen- tations to support analytical reasoning over multiple documents. InIEEE Big Data 2024,

2024

[10] [10]

Accepted in IEEE Big Data 2024.https://arxiv.org/html/2411.16116v1

https: //arxiv.org/html/2411.16116v1. Accepted in IEEE Big Data 2024.https://arxiv.org/html/2411.16116v1. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Peter Pham, AnirudhRavula, QifanWang, LiYang, andAmr Ahmed. Big bird: Transformers for longer sequences. InNeurIPS,

work page arXiv 2024

[11] [11]

Dhen: A deep and hierar- chical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022a

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al. Dhen: A deep and hierar- chical ensemble network for large-scale click-through rate prediction.arXiv preprint arXiv:2203.11014, 2022a. Wei Zhang, Dai Li, Chen Liang, Fang Zhou, Zhongke Zhang, Xuewei Wang, Ru Li, Yi Zhou, Yaning H...

work page arXiv 2024

[12] [12]

Towards understanding the overfitting phenomenon of deep click-through rate prediction models

Zhao-Yu Zhang, Xiang-Rong Sheng, Yujing Zhang, Biye Jiang, Shuguang Han, Hongbo Deng, and Bo Zheng. Towards understanding the overfitting phenomenon of deep click-through rate prediction models. InPro- ceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), 11 pages 2202–2212, Atlanta, GA, USA, 2022b. ACM. doi...

work page doi:10.1145/3511808.3557479