pith. sign in

arxiv: 2605.16264 · v1 · pith:USZ5GL33new · submitted 2026-03-22 · 💻 cs.HC · cs.CL

LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging

Pith reviewed 2026-05-21 10:29 UTC · model grok-4.3

classification 💻 cs.HC cs.CL
keywords LLM notification compositionpush notificationsmessage qualitycontext-aware messagingpersuasive notificationsCTR improvementuser engagement
0
0 comments X

The pith

LLM-generated notifications improve click rates 8 to 14.5 percent over templates by raising message quality across six dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the wording of push notifications has been treated as a secondary detail while systems focused on selecting recipients and delivery times. It defines message quality through six concrete dimensions and shows LLM composition advances each one relative to fixed templates or slot-filling methods. Reported gains appear in deployments across social media, food delivery, and e-commerce, though the authors stress that these lifts must be separated from changes in targeting or ranking logic. A reader would care because treating message generation as its own module could raise engagement without rebuilding the rest of the notification pipeline.

Core claim

Notification message quality is an independent, underinvested lever for engagement that has received less attention than targeting and timing. LLM-based composition improves this quality along six dimensions—contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness—relative to templates, with reported CTR improvements from +8% to +14.5% over static templates across reviewed deployments. An architectural attribution analysis disentangles message generation from targeting, ranking, and timing to address misattribution risks, and a three-criterion decision framework specifies when LLM generation is the binding constraint.

What carries the argument

The six-dimension notification message quality framework, which evaluates contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness to isolate the contribution of LLM composition from other system components.

If this is right

  • Treating message generation as a distinct layer allows platforms to capture CTR gains without altering user-selection or delivery-timing logic.
  • The three-criterion decision framework limits LLM use to cases where text quality is the actual bottleneck, avoiding unnecessary compute costs.
  • A unified architecture that adds budget-aware routing, grounded generation, and online learning can integrate LLM composition into existing notification stacks.
  • Domain applications in social media, food delivery, and e-commerce can adopt the same quality dimensions and attribution checks to validate gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the six dimensions transfer to other channels, the same separation of generation from targeting could raise response rates for emails or in-app prompts.
  • Online learning inside the proposed architecture might eventually personalize message style per user rather than per notification type.
  • Widespread adoption would shift engineering effort from refining ranking models toward monitoring and refining generation quality metrics.

Load-bearing premise

The CTR gains seen in the reviewed deployments are driven mainly by improvements in the wording of the messages rather than by differences in which users are chosen or when the notifications are delivered.

What would settle it

A controlled A/B test inside one notification platform that keeps targeting, ranking, and timing identical while switching only between LLM-generated messages and static templates, then measuring whether the CTR lift remains in the 8-14.5 percent range.

Figures

Figures reproduced from arXiv: 2605.16264 by Nilesh Agrawal.

Figure 1
Figure 1. Figure 1: Unified LLM-based notification pipeline. Solid arrows: generation flow. Dashed arrows: [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
read the original abstract

Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least-optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot-filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three-criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA-guided survey (28 sources from 142 screened), examine domain-specific applications across social media, food delivery, and e-commerce, and propose a unified architectural framework with budget-aware routing, grounded generation, candidate ranking, diversity controls, and online learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a PRISMA-guided survey of 28 sources drawn from 142 screened to examine LLM-based composition of push notifications. It defines message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness), claims LLM generation improves each relative to templates or slot-filling systems, and reports CTR lifts of +8% to +14.5% over static templates and +1% to +2.5% over mature systems from the reviewed deployments. It supplies an architectural attribution analysis to separate message generation from targeting, ranking, and timing, and introduces a three-criterion decision framework plus a unified architecture with budget-aware routing, grounded generation, and online learning.

Significance. If the attribution of gains holds after controlling for confounds, the work usefully elevates message composition as an independent optimization lever in notification systems and supplies a practical decision framework that could inform deployment choices in HCI applications such as social media, food delivery, and e-commerce. The explicit caution about heterogeneous sources and the architectural disentanglement are constructive contributions that could guide more rigorous future evaluations.

major comments (2)
  1. [Abstract] Abstract: The central claim that LLM-based composition improves the six quality dimensions and produces +8% to +14.5% CTR gains rests on reviewed deployments whose heterogeneity is explicitly flagged in the abstract itself; without new controlled experiments that hold targeting, ranking, and timing fixed while varying only generation method, the attribution to message quality alone remains weakly supported and load-bearing for the paper's primary argument.
  2. [Architectural attribution analysis] Architectural attribution analysis (as summarized in the abstract and contributions): While the analysis correctly identifies the risk that gains may be misattributed to text generation, it does not supply quantitative bounds, proposed experimental designs, or re-analysis of the 28 sources that would isolate the composition effect; this leaves the misattribution concern noted in the stress-test unresolved at the level required to sustain the reported CTR ranges as evidence for the six-dimension improvements.
minor comments (2)
  1. [Survey methodology] The PRISMA screening process (142 to 28 sources) would benefit from an explicit flow diagram or table listing inclusion/exclusion criteria and the final set of sources to improve reproducibility.
  2. [Definition of quality dimensions] Clarify potential overlap among the six quality dimensions, particularly between contextual relevance and persuasive appropriateness, with concrete examples from the domain applications.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify the evidentiary challenges in attributing CTR gains specifically to message composition amid heterogeneous deployments. As a survey and framework paper, we address these points by clarifying limitations, strengthening caveats, and proposing paths for future rigorous evaluation. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that LLM-based composition improves the six quality dimensions and produces +8% to +14.5% CTR gains rests on reviewed deployments whose heterogeneity is explicitly flagged in the abstract itself; without new controlled experiments that hold targeting, ranking, and timing fixed while varying only generation method, the attribution to message quality alone remains weakly supported and load-bearing for the paper's primary argument.

    Authors: We agree that the reported CTR ranges derive from heterogeneous sources and that isolating the effect of generation method would require controlled experiments holding targeting, ranking, and timing constant. The manuscript already includes explicit language in the abstract and contributions section cautioning against direct comparability. As this work is a PRISMA-guided survey synthesizing existing deployments rather than a primary empirical study, we do not conduct new experiments. In revision we will expand the discussion section with a dedicated subsection outlining concrete experimental designs (e.g., within-platform A/B tests that fix all other components) to isolate composition effects, and we will further foreground the current evidential limitations. These changes will make the paper's claims more precisely scoped while preserving its contributions in defining the six dimensions and the decision framework. revision: partial

  2. Referee: [Architectural attribution analysis] Architectural attribution analysis (as summarized in the abstract and contributions): While the analysis correctly identifies the risk that gains may be misattributed to text generation, it does not supply quantitative bounds, proposed experimental designs, or re-analysis of the 28 sources that would isolate the composition effect; this leaves the misattribution concern noted in the stress-test unresolved at the level required to sustain the reported CTR ranges as evidence for the six-dimension improvements.

    Authors: The attribution analysis is intended to surface misattribution risks and disentangle generation from adjacent system components; we view this disentanglement itself as a useful contribution. Quantitative re-analysis of the 28 sources to derive tighter bounds is not feasible, because the original publications generally lack the granular per-component data or experimental controls required for such isolation. We will, however, add proposed experimental designs and a brief discussion of possible sensitivity-analysis approaches in the revised manuscript. These additions will directly respond to the concern by providing actionable guidance for future work that could strengthen attribution. revision: partial

standing simulated objections not resolved
  • Re-analysis of the 28 sources to produce quantitative bounds isolating the composition effect, as the source papers do not contain the necessary component-level data or controls.

Circularity Check

0 steps flagged

No significant circularity detected; survey and framework are self-contained

full rationale

The paper is a PRISMA-guided survey of 28 external sources plus a proposed architectural framework and decision criteria. No equations, derivations, or predictions reduce by construction to fitted parameters or self-referential inputs. CTR figures (+8% to +14.5%) are explicitly attributed to reviewed heterogeneous deployments rather than internal fitting, and the attribution analysis flags misattribution risks without relying on self-citation chains or definitional loops. Claims rest on external benchmarks, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that LLM-generated text is measurably superior in the listed dimensions and that survey sources provide representative evidence of real-world gains; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLM-generated messages improve contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness relative to static templates.
    Invoked in the first contribution when stating improvements across the six dimensions.
  • domain assumption Observed CTR gains in reviewed deployments can be at least partly attributed to message composition rather than adjacent system components.
    Central to the architectural attribution analysis and the claim that gains are frequently misattributed.

pith-pipeline@v0.9.0 · 5795 in / 1342 out tokens · 40944 ms · 2026-05-21T10:29:57.333738+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 4 internal anchors

  1. [1]

    URLhttps://arxiv.org/abs/2512.14490. X. Sun et al. A New Ranking Framework for Better Notification Quality on Instagram. Engi- neering at Meta,

  2. [2]

    URLhttps://engineering.fb.com/2025/09/02/ml-applications/ a-new-ranking-framework-for-better-notification-quality-on-instagram/. N. Sinha. Beyond the Click: Elevating DoorDash’s Personalized Notification Experience with GNN Recommendation. DoorDash Engineering Blog,

  3. [3]

    ACM Digital Library

    URLhttps://careersatdoordash.com/ blog/doordash-customize-notifications-how-gnn-work/. ACM Digital Library. LLM-Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion. InProceedings of the ACM Web Conference 2025,

  4. [4]

    URLhttps: //dl.acm.org/doi/10.1145/3757749.3757850. Y. Tu, K. Basu, C. DiCiccio, et al. Personalized Treatment Selection Using Causal Heterogeneity. InProceedings of the Web Conference 2021 (WWW ’21),

  5. [5]

    URLhttps://dl.acm.org/doi/ abs/10.1145/3442381.3450075. K. P. Yancey et al. A Sleeping, Recovering Bandit Algorithm for Optimizing Notifications. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,

  6. [6]

    URLhttps://pmc.ncbi.nlm.nih.gov/ articles/PMC10244611/. P. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS 2020),

  7. [7]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    URLhttps://arxiv.org/ abs/2005.11401. E.J.Huetal. LoRA:Low-RankAdaptationofLargeLanguageModels. InInternational Conference on Learning Representations (ICLR),

  8. [8]

    URLhttps://arxiv.org/abs/2106.09685. Z. Han et al. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey.arXiv preprint arXiv:2403.14608,

  9. [9]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    URLhttps://arxiv.org/abs/2403.14608. 16 LLM-Based Intelligent Notification Composition Agrawal, 2026 COLING. Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Gen- eration. InProceedings of COLING 2025,

  10. [10]

    coling-main.384/

    URLhttps://aclanthology.org/2025. coling-main.384/. EMNLP. Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards. InProceedings of EMNLP 2025,

  11. [12]

    URLhttps://arxiv.org/html/2303.11666v2. D. Jannach et al. A Survey on Multi-Objective Recommender Systems.National Center for Biotechnology Information (PMC),

  12. [13]

    URLhttps://policyreview.info/articles/analysis/ technology-autonomy-and-manipulation. A. Mathur, M. Kshirsagar, and J. Mayer. What Makes a Dark Pattern... Dark? Design Attributes, Normative Considerations, and Measurement Methods. InProceedings of the 2021 CHI Confer- ence on Human Factors in Computing Systems,

  13. [14]

    1145/3411764.3445610

    URLhttps://dl.acm.org/doi/abs/10. 1145/3411764.3445610. Y. Wang et al. A Survey on the Fairness of Recommender Systems.ACM Transactions on Infor- mation Systems,

  14. [15]

    URLhttps://dl.acm.org/doi/10.1145/3547333. M. Fabbri. An Ethical Perspective on the Implementation of the Transparency Requirements for Recommender Systems Set by the Digital Services Act of the European Union. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society,

  15. [16]

    org/doi/abs/10.1145/3600211.3604717

    URLhttps://dl.acm. org/doi/abs/10.1145/3600211.3604717. A. Vaswani et al. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS 2017),

  16. [17]

    URLhttps://arxiv.org/abs/1706.03762. B. J. Ho et al. Notifying Users at the Right Time Using Reinforcement Learning.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies,

  17. [18]

    URLhttps: //dl.acm.org/doi/pdf/10.1145/3267305.3274107. T. Joachims, A. Swaminathan, and M. de Rijke. Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations (ICLR),