LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging
Pith reviewed 2026-05-21 10:29 UTC · model grok-4.3
The pith
LLM-generated notifications improve click rates 8 to 14.5 percent over templates by raising message quality across six dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Notification message quality is an independent, underinvested lever for engagement that has received less attention than targeting and timing. LLM-based composition improves this quality along six dimensions—contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness—relative to templates, with reported CTR improvements from +8% to +14.5% over static templates across reviewed deployments. An architectural attribution analysis disentangles message generation from targeting, ranking, and timing to address misattribution risks, and a three-criterion decision framework specifies when LLM generation is the binding constraint.
What carries the argument
The six-dimension notification message quality framework, which evaluates contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness to isolate the contribution of LLM composition from other system components.
If this is right
- Treating message generation as a distinct layer allows platforms to capture CTR gains without altering user-selection or delivery-timing logic.
- The three-criterion decision framework limits LLM use to cases where text quality is the actual bottleneck, avoiding unnecessary compute costs.
- A unified architecture that adds budget-aware routing, grounded generation, and online learning can integrate LLM composition into existing notification stacks.
- Domain applications in social media, food delivery, and e-commerce can adopt the same quality dimensions and attribution checks to validate gains.
Where Pith is reading between the lines
- If the six dimensions transfer to other channels, the same separation of generation from targeting could raise response rates for emails or in-app prompts.
- Online learning inside the proposed architecture might eventually personalize message style per user rather than per notification type.
- Widespread adoption would shift engineering effort from refining ranking models toward monitoring and refining generation quality metrics.
Load-bearing premise
The CTR gains seen in the reviewed deployments are driven mainly by improvements in the wording of the messages rather than by differences in which users are chosen or when the notifications are delivered.
What would settle it
A controlled A/B test inside one notification platform that keeps targeting, ranking, and timing identical while switching only between LLM-generated messages and static templates, then measuring whether the CTR lift remains in the 8-14.5 percent range.
Figures
read the original abstract
Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least-optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot-filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three-criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA-guided survey (28 sources from 142 screened), examine domain-specific applications across social media, food delivery, and e-commerce, and propose a unified architectural framework with budget-aware routing, grounded generation, candidate ranking, diversity controls, and online learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a PRISMA-guided survey of 28 sources drawn from 142 screened to examine LLM-based composition of push notifications. It defines message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness), claims LLM generation improves each relative to templates or slot-filling systems, and reports CTR lifts of +8% to +14.5% over static templates and +1% to +2.5% over mature systems from the reviewed deployments. It supplies an architectural attribution analysis to separate message generation from targeting, ranking, and timing, and introduces a three-criterion decision framework plus a unified architecture with budget-aware routing, grounded generation, and online learning.
Significance. If the attribution of gains holds after controlling for confounds, the work usefully elevates message composition as an independent optimization lever in notification systems and supplies a practical decision framework that could inform deployment choices in HCI applications such as social media, food delivery, and e-commerce. The explicit caution about heterogeneous sources and the architectural disentanglement are constructive contributions that could guide more rigorous future evaluations.
major comments (2)
- [Abstract] Abstract: The central claim that LLM-based composition improves the six quality dimensions and produces +8% to +14.5% CTR gains rests on reviewed deployments whose heterogeneity is explicitly flagged in the abstract itself; without new controlled experiments that hold targeting, ranking, and timing fixed while varying only generation method, the attribution to message quality alone remains weakly supported and load-bearing for the paper's primary argument.
- [Architectural attribution analysis] Architectural attribution analysis (as summarized in the abstract and contributions): While the analysis correctly identifies the risk that gains may be misattributed to text generation, it does not supply quantitative bounds, proposed experimental designs, or re-analysis of the 28 sources that would isolate the composition effect; this leaves the misattribution concern noted in the stress-test unresolved at the level required to sustain the reported CTR ranges as evidence for the six-dimension improvements.
minor comments (2)
- [Survey methodology] The PRISMA screening process (142 to 28 sources) would benefit from an explicit flow diagram or table listing inclusion/exclusion criteria and the final set of sources to improve reproducibility.
- [Definition of quality dimensions] Clarify potential overlap among the six quality dimensions, particularly between contextual relevance and persuasive appropriateness, with concrete examples from the domain applications.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify the evidentiary challenges in attributing CTR gains specifically to message composition amid heterogeneous deployments. As a survey and framework paper, we address these points by clarifying limitations, strengthening caveats, and proposing paths for future rigorous evaluation. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that LLM-based composition improves the six quality dimensions and produces +8% to +14.5% CTR gains rests on reviewed deployments whose heterogeneity is explicitly flagged in the abstract itself; without new controlled experiments that hold targeting, ranking, and timing fixed while varying only generation method, the attribution to message quality alone remains weakly supported and load-bearing for the paper's primary argument.
Authors: We agree that the reported CTR ranges derive from heterogeneous sources and that isolating the effect of generation method would require controlled experiments holding targeting, ranking, and timing constant. The manuscript already includes explicit language in the abstract and contributions section cautioning against direct comparability. As this work is a PRISMA-guided survey synthesizing existing deployments rather than a primary empirical study, we do not conduct new experiments. In revision we will expand the discussion section with a dedicated subsection outlining concrete experimental designs (e.g., within-platform A/B tests that fix all other components) to isolate composition effects, and we will further foreground the current evidential limitations. These changes will make the paper's claims more precisely scoped while preserving its contributions in defining the six dimensions and the decision framework. revision: partial
-
Referee: [Architectural attribution analysis] Architectural attribution analysis (as summarized in the abstract and contributions): While the analysis correctly identifies the risk that gains may be misattributed to text generation, it does not supply quantitative bounds, proposed experimental designs, or re-analysis of the 28 sources that would isolate the composition effect; this leaves the misattribution concern noted in the stress-test unresolved at the level required to sustain the reported CTR ranges as evidence for the six-dimension improvements.
Authors: The attribution analysis is intended to surface misattribution risks and disentangle generation from adjacent system components; we view this disentanglement itself as a useful contribution. Quantitative re-analysis of the 28 sources to derive tighter bounds is not feasible, because the original publications generally lack the granular per-component data or experimental controls required for such isolation. We will, however, add proposed experimental designs and a brief discussion of possible sensitivity-analysis approaches in the revised manuscript. These additions will directly respond to the concern by providing actionable guidance for future work that could strengthen attribution. revision: partial
- Re-analysis of the 28 sources to produce quantitative bounds isolating the composition effect, as the source papers do not contain the necessary component-level data or controls.
Circularity Check
No significant circularity detected; survey and framework are self-contained
full rationale
The paper is a PRISMA-guided survey of 28 external sources plus a proposed architectural framework and decision criteria. No equations, derivations, or predictions reduce by construction to fitted parameters or self-referential inputs. CTR figures (+8% to +14.5%) are explicitly attributed to reviewed heterogeneous deployments rather than internal fitting, and the attribution analysis flags misattribution risks without relying on self-citation chains or definitional loops. Claims rest on external benchmarks, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-generated messages improve contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness relative to static templates.
- domain assumption Observed CTR gains in reviewed deployments can be at least partly attributed to message composition rather than adjacent system components.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
URLhttps://engineering.fb.com/2025/09/02/ml-applications/ a-new-ranking-framework-for-better-notification-quality-on-instagram/. N. Sinha. Beyond the Click: Elevating DoorDash’s Personalized Notification Experience with GNN Recommendation. DoorDash Engineering Blog,
work page 2025
-
[3]
URLhttps://careersatdoordash.com/ blog/doordash-customize-notifications-how-gnn-work/. ACM Digital Library. LLM-Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion. InProceedings of the ACM Web Conference 2025,
work page 2025
-
[4]
URLhttps: //dl.acm.org/doi/10.1145/3757749.3757850. Y. Tu, K. Basu, C. DiCiccio, et al. Personalized Treatment Selection Using Causal Heterogeneity. InProceedings of the Web Conference 2021 (WWW ’21),
-
[5]
URLhttps://dl.acm.org/doi/ abs/10.1145/3442381.3450075. K. P. Yancey et al. A Sleeping, Recovering Bandit Algorithm for Optimizing Notifications. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
-
[6]
URLhttps://pmc.ncbi.nlm.nih.gov/ articles/PMC10244611/. P. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS 2020),
work page 2020
-
[7]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
URLhttps://arxiv.org/ abs/2005.11401. E.J.Huetal. LoRA:Low-RankAdaptationofLargeLanguageModels. InInternational Conference on Learning Representations (ICLR),
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[8]
URLhttps://arxiv.org/abs/2106.09685. Z. Han et al. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey.arXiv preprint arXiv:2403.14608,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
URLhttps://arxiv.org/abs/2403.14608. 16 LLM-Based Intelligent Notification Composition Agrawal, 2026 COLING. Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Gen- eration. InProceedings of COLING 2025,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
URLhttps://aclanthology.org/2025. coling-main.384/. EMNLP. Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards. InProceedings of EMNLP 2025,
work page 2025
- [12]
-
[13]
URLhttps://policyreview.info/articles/analysis/ technology-autonomy-and-manipulation. A. Mathur, M. Kshirsagar, and J. Mayer. What Makes a Dark Pattern... Dark? Design Attributes, Normative Considerations, and Measurement Methods. InProceedings of the 2021 CHI Confer- ence on Human Factors in Computing Systems,
work page 2021
-
[14]
URLhttps://dl.acm.org/doi/abs/10. 1145/3411764.3445610. Y. Wang et al. A Survey on the Fairness of Recommender Systems.ACM Transactions on Infor- mation Systems,
-
[15]
URLhttps://dl.acm.org/doi/10.1145/3547333. M. Fabbri. An Ethical Perspective on the Implementation of the Transparency Requirements for Recommender Systems Set by the Digital Services Act of the European Union. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society,
-
[16]
org/doi/abs/10.1145/3600211.3604717
URLhttps://dl.acm. org/doi/abs/10.1145/3600211.3604717. A. Vaswani et al. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS 2017),
-
[17]
URLhttps://arxiv.org/abs/1706.03762. B. J. Ho et al. Notifying Users at the Right Time Using Reinforcement Learning.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
URLhttps: //dl.acm.org/doi/pdf/10.1145/3267305.3274107. T. Joachims, A. Swaminathan, and M. de Rijke. Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations (ICLR),
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.