pith. machine review for the scientific record. sign in

arxiv: 2604.08621 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.HC· cs.LG

Recognition: unknown

Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG
keywords agentic personalizationmarketing personalizationlongitudinal case studyautonomous agentshuman-in-the-loopCRMengagement metricspersonalized messaging
0
0 comments X

The pith

Autonomous agents sustain positive engagement lifts in marketing after initial human setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tracks a real-world consumer app that uses agentic systems to personalize marketing messages over time. It splits the observation into an active phase of direct human curation of content and audiences, followed by an 11-month passive phase where agents operate alone from a fixed library. Results show the largest engagement gains during active management, yet the agents maintain a positive lift without further human input. The work explores whether ongoing oversight is necessary to keep personalization benefits from fading.

Core claim

The longitudinal case study demonstrates that while active human management of content, audiences, and strategies generates the highest relative lift in engagement metrics, autonomous agents operating from a fixed library successfully sustain a positive lift throughout the subsequent 11-month passive period, supporting a model where humans handle initial discovery and agents preserve gains scalably.

What carries the argument

Agentic infrastructure for autonomous personalization from a fixed library of components, measured by comparing engagement metrics in an active human-managed phase against a following passive autonomous phase.

If this is right

  • Human effort can shift toward initial strategy and discovery rather than continuous management.
  • Performance gains from personalization can be retained scalably without matching increases in human resources.
  • A symbiotic workflow allows humans to set direction and agents to maintain results in CRM applications.
  • Fixed component libraries can support ongoing personalization in environments without rapid external shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent-based maintenance model could extend to other sustained personalization tasks such as product recommendations or support messaging.
  • Measuring cost or time savings from reduced human oversight would clarify the practical value beyond engagement metrics.
  • Repeating the active-to-passive switch in markets with higher volatility would test how long the sustained lift holds under different conditions.

Load-bearing premise

The fixed library of components stays relevant and effective over the full 11 months without degradation or external changes requiring updates.

What would settle it

Engagement metrics during the passive phase dropping to or below pre-personalization baseline levels would show the agents failed to sustain a positive lift.

Figures

Figures reproduced from arXiv: 2604.08621 by Eleanor Hanna, Olivier Jeunen, Schaun Wheeler.

Figure 1
Figure 1. Figure 1: A timeline describing our longitudinal setup. We analyse 11 months of randomised controlled trial data, where the first [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualising the feedback loop comprising end users, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relative lift across metrics and phases. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

In consumer applications, Customer Relationship Management (CRM) has traditionally relied on the manual optimisation of static, rule-based messaging strategies. While adaptive and autonomous learning systems offer the promise of scalable personalisation, it remains unclear to what extent ``human-in-the-loop'' oversight is required to sustain performance uplift over time. This paper presents a longitudinal case study analysing a real-world consumer application that leverages agentic infrastructure to personalise marketing messaging for a large-scale user base over an 11-month period. We compare two distinct periods: an active phase where marketers directly curated content, audiences, and strategies -- followed immediately by a passive phase where agents operated autonomously from a fixed library of components. Our results demonstrate that whilst active human management generates the highest relative lift in engagement metrics, the autonomous agents successfully sustained a positive lift during the passive period. These findings suggest a symbiotic model where human intervention drives strategic initialisation and discovery, yet autonomous agents can ensure the scalable retention and preservation of performance gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a longitudinal case study of agentic personalization in a consumer CRM application. It contrasts an active phase (human curation of content, audiences, and strategies) with an immediately following 11-month passive phase (autonomous agents operating on a fixed library of components), claiming that human management produces the highest engagement lift while the agents sustain a positive (though lower) lift without further intervention.

Significance. If the attribution to the agents can be isolated from external factors, the result would support a practical hybrid model in which humans handle initial discovery and agents handle scalable maintenance of personalization gains. The 11-month duration is a notable strength for a real-world deployment study.

major comments (3)
  1. [Abstract; Results] The central claim that autonomous agents 'successfully sustained a positive lift' during the passive phase is load-bearing for the paper's contribution, yet the abstract and results supply no quantitative metrics, confidence intervals, statistical tests, baseline comparisons, or sample sizes. Without these, the magnitude and reliability of the sustained lift cannot be assessed.
  2. [Methodology; Results] The case-study design (single cohort, active-to-passive transition) provides no concurrent control group, difference-in-differences estimator, or regression controls for time-varying confounders. Over 11 months, engagement metrics are plausibly affected by seasonality, platform algorithm changes, user fatigue, or market shifts; the manuscript therefore cannot isolate the agents' contribution from these factors.
  3. [Passive Phase Description] The assumption that the fixed library of components remains effective and relevant throughout the passive window is stated but not tested. No analysis of component usage frequency, performance degradation, or need for updates is reported, which directly affects the interpretation of 'sustained' performance.
minor comments (2)
  1. [Abstract] The abstract uses informal phrasing ('whilst', double backticks around 'human-in-the-loop') that should be standardized for journal style.
  2. [Data and Metrics] No mention of exclusion criteria, data cleaning steps, or how engagement metrics were defined and collected.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our longitudinal case study of agentic personalization in marketing. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging its limitations as a real-world deployment study. Revisions have been made where feasible to strengthen the presentation without overstating the evidence.

read point-by-point responses
  1. Referee: [Abstract; Results] The central claim that autonomous agents 'successfully sustained a positive lift' during the passive phase is load-bearing for the paper's contribution, yet the abstract and results supply no quantitative metrics, confidence intervals, statistical tests, baseline comparisons, or sample sizes. Without these, the magnitude and reliability of the sustained lift cannot be assessed.

    Authors: We agree that the abstract and results would benefit from more explicit quantitative reporting to allow readers to evaluate the claims. The full results section of the manuscript does contain comparative engagement metrics between the active and passive phases along with the overall user cohort size, but these were not summarized with sufficient precision or statistical detail in the abstract. We have revised the abstract to report the specific relative lift values observed (including the sustained positive lift in the passive phase), the sample size of the full user base, and baseline pre-active engagement levels. In the results section, we have added confidence intervals, p-values from appropriate statistical tests for the phase comparisons, and explicit baseline metrics. revision: yes

  2. Referee: [Methodology; Results] The case-study design (single cohort, active-to-passive transition) provides no concurrent control group, difference-in-differences estimator, or regression controls for time-varying confounders. Over 11 months, engagement metrics are plausibly affected by seasonality, platform algorithm changes, user fatigue, or market shifts; the manuscript therefore cannot isolate the agents' contribution from these factors.

    Authors: We recognize that the single-cohort before-after design cannot fully isolate the agents' causal contribution from external time-varying factors, as a concurrent control arm was not feasible in this live CRM deployment for business and ethical reasons. The immediate transition from active human oversight to passive agent operation provides a sharp natural breakpoint for longitudinal comparison within the same population. We have added an expanded Limitations subsection that explicitly discusses potential confounders (seasonality, platform changes, user fatigue, market shifts) and notes that the observed sustained lift occurred despite these pressures. We do not claim definitive causality and have tempered language throughout to describe observed associations and sustained performance rather than isolated agent effects. revision: partial

  3. Referee: [Passive Phase Description] The assumption that the fixed library of components remains effective and relevant throughout the passive window is stated but not tested. No analysis of component usage frequency, performance degradation, or need for updates is reported, which directly affects the interpretation of 'sustained' performance.

    Authors: We accept that the manuscript should have included direct evidence supporting the stability of the fixed component library. We have performed additional post-hoc analysis of the passive-phase data, quantifying usage frequency for each component across the 11 months and tracking performance trends for signs of degradation. The results show stable usage patterns with no statistically significant decline in component-level engagement metrics, which we now report in a new subsection of the results with an accompanying figure. This supports the interpretation of sustained performance without requiring updates during the passive window. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical case study with no derivations or equations

full rationale

The paper is a longitudinal case study comparing engagement metrics across an active human-managed phase and a subsequent passive autonomous-agent phase over 11 months. No equations, parameters, derivations, or predictive models are present in the provided text or abstract. Claims rest on direct observation of lift in metrics rather than any chain that reduces to fitted inputs, self-definitions, or self-citations. The skeptic concern about unmeasured confounders is a validity issue, not circularity. The derivation chain is empty by construction, making this a self-contained empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No theoretical model, equations, or postulates; the work is an observational case study.

pith-pipeline@v0.9.0 · 5473 in / 986 out tokens · 69554 ms · 2026-05-10T17:38:35.407180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 17 canonical work pages

  1. [1]

    Sami Abboud, Eleanor Hanna, Olivier Jeunen, Vineesha Raheja, and Schaun Wheeler. 2025. Agentic Personalisation of Cross-Channel Marketing Experiences. InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). ACM, 907–910. doi:10.1145/3705328.3748125

  2. [2]

    Susan Athey and Guido W. Imbens. 2006. Identification and Inference in Non- linear Difference-in-Differences Models.Econometrica74, 2 (2006), 431–497. doi:10.1111/j.1468-0262.2006.00668.x

  3. [3]

    Shifu Bie, Jiangxia Cao, Zixiao Luo, Yichuan Zou, Lei Liang, Lu Zhang, Linxun Chen, Zhaojie Liu, Xuanping Li, Guorui Zhou, Kaiqiao Zhan, and Kun Gai. 2025. Olivier Jeunen, Eleanor Hanna, and Schaun Wheeler PushGen: Push Notifications Generation with LLM. arXiv:2512.14490 [cs.IR] https://arxiv.org/abs/2512.14490 To appear in WSDM ’26

  4. [4]

    Yan Gao, Viral Gupta, Jinyun Yan, Changji Shi, Zhongen Tao, PJ Xiao, Curtis Wang, Shipeng Yu, Romer Rosales, Ajith Muralidharan, and Shaunak Chatterjee

  5. [5]

    InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(London, United Kingdom)(KDD ’18)

    Near Real-time Optimization of Activity-based Notifications. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(London, United Kingdom)(KDD ’18). Association for Computing Machinery, New York, NY, USA, 283–292. doi:10.1145/3219819.3219880

  6. [6]

    Gomez-Uribe and Neil Hunt

    Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Recommender System: Algorithms, Business Value, and Innovation.ACM Trans. Manage. Inf. Syst.6, 4, Article 13 (Dec. 2016), 19 pages. doi:10.1145/2843948

  7. [7]

    Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. 2021. Machine Learning for Variance Reduction in Online Exper- iments. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 8637–8648

  8. [8]

    Rupesh Gupta, Guanfeng Liang, Hsiao-Ping Tseng, Ravi Kiran Holur Vijay, Xi- aoyu Chen, and Romer Rosales. 2016. Email Volume Optimization at LinkedIn. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). Asso- ciation for Computing Machinery, New York, NY, USA, 97–10...

  9. [9]

    Grace Huang. 2020. Mesa: Building a Personalized Messaging System at Net- flix. Data Council. https://aicouncil.com/talks/mesa-building-a-personalized- messaging-system-at-netflix Accessed: 2026-02

  10. [10]

    Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. InProc. of the 16th ACM Confer- ence on Recommender Systems (RecSys ’22). ACM, 654–657. doi:10.1145/3523227. 3547409

  11. [11]

    Olivier Jeunen and Schaun Wheeler. 2026. Behavioural Effects of Agentic Messag- ing. InAdvances in Information Retrieval. Springer Nature Switzerland, 105–110

  12. [12]

    Thorsten Joachims, Ben London, Yi Su, Adith Swaminathan, and Lequn Wang

  13. [13]

    2021), 19–30

    Recommendations as Treatments.AI Magazine42, 3 (Nov. 2021), 19–30

  14. [14]

    2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing

    Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing. Cambridge University Press

  15. [15]

    Christian Kroer, Deeksha Sinha, Xuan Zhang, Shiwen Cheng, and Ziyu Zhou. 2023. Fair Notification Optimization: An Auction Approach. arXiv:2302.04835 [cs.GT] https://arxiv.org/abs/2302.04835

  16. [16]

    2018.Customer relationship management

    Vineet Kumar and Werner Reinartz. 2018.Customer relationship management. Springer

  17. [17]

    Conor O’Brien, Huasen Wu, Shaodan Zhai, Dalin Guo, Wenzhe Shi, and Jonathan J Hunt. 2022. Should I send this notification? Optimizing push no- tifications decision making by modeling the future. arXiv:2202.08812 [cs.IR] https://arxiv.org/abs/2202.08812

  18. [18]

    Art B. Owen. 2013.Monte Carlo theory, methods and examples

  19. [19]

    Prakruthi Prabhakar, Yiping Yuan, Guangyu Yang, Wensheng Sun, and Ajith Muralidharan. 2022. Multi-objective Optimization of Notifications Using Offline Reinforcement Learning. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Washington DC, USA)(KDD ’22). Association for Computing Machinery, New York, NY, USA, 3752–376...

  20. [20]

    Thompson

    William R. Thompson. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.Biometrika25, 3/4 (1933), 285–294. http://www.jstor.org/stable/2332286

  21. [21]

    Caroline Tynan and Jennifer Drayton

    A. Caroline Tynan and Jennifer Drayton. 1987. Market segmentation.Journal of Marketing Management2, 3 (1987), 301–335. doi:10.1080/0267257X.1987.9964020

  22. [22]

    Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. doi:10.1145/3340631.3398666

  23. [23]

    Schaun Wheeler and Olivier Jeunen. 2025. Procedural Memory Is Not All You Need: Bridging Cognitive Gaps in LLM-Based Agents. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct ’25). Association for Computing Machinery, New York, NY, USA, 360–364. doi:10.1145/3708319.3734172

  24. [24]

    Yancey and Burr Settles

    Kevin P. Yancey and Burr Settles. 2020. A Sleeping, Recovering Bandit Algorithm for Optimizing Recurring Notifications. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Virtual Event, CA, USA)(KDD ’20). Association for Computing Machinery, New York, NY, USA, 3008–3016. doi:10.1145/3394486.3403351

  25. [25]

    Yiping Yuan, Ajith Muralidharan, Preetam Nandy, Miao Cheng, and Prakruthi Prabhakar. 2022. Offline Reinforcement Learning for Mobile Notifications. In Proceedings of the 31st ACM International Conference on Information & Knowl- edge Management(Atlanta, GA, USA)(CIKM ’22). Association for Computing Machinery, New York, NY, USA, 3614–3623. doi:10.1145/35118...

  26. [26]

    Yuchen Zhang, Mingjun Zhao, Chenglin Li, Weiyu Tou, Haolan Chen, Di Niu, Cunxiang Yin, Yancheng He, and Fei Guo. 2023. Online Volume Optimization for Notifications via Long Short-Term Value Modeling. InAdvances in Knowledge Discovery and Data Mining, Hisashi Kashima, Tsuyoshi Ide, and Wen-Chih Peng (Eds.). Springer Nature Switzerland, Cham, 16–28

  27. [27]

    Bo Zhao, Koichiro Narita, Burkay Orten, and John Egan. 2018. Notification Volume Control and Optimization System at Pinterest. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom)(KDD ’18). Association for Computing Machinery, New York, NY, USA, 1012–1020. doi:10.1145/3219819.3219906