LLM Retrieval for Stable and Predictable Ad Recommendations

Atul Jangra; Benjamin Schulte; Deepak Chandra; Gaby Nahum; Hangjun Xu; Jean-Baptiste Fiot; Jinghao Yan; Kshitij Gupta; Sai Deepika Regani; Satheeshkumar Karuppusamy

arxiv: 2605.21969 · v1 · pith:RIEFKFJ5new · submitted 2026-05-21 · 💻 cs.IR · cs.AI

LLM Retrieval for Stable and Predictable Ad Recommendations

Vinodh Kumar Sunkara , Satheeshkumar Karuppusamy , Hangjun Xu , Sai Deepika Regani , Kshitij Gupta , Gaby Nahum , Sneha Iyer , Jean-Baptiste Fiot

show 8 more authors

Yinglong Guo Xiaowen Guo Atul Jangra Yucheng Liu Jinghao Yan Vijay Pappu Benjamin Schulte Deepak Chandra

This is my paper

Pith reviewed 2026-05-22 04:42 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords LLM retrievalad recommendationsstability and predictabilitysemantic candidate generationgraph-based expansionhierarchical semantic attributesads recommender systems

0 comments

The pith

Fine-tuned LLMs extract hierarchical semantic attributes from ad creatives to support graph-based expansion that improves stability and predictability in recommendations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an evaluation framework to quantify stability and predictability in ads recommender systems, where these properties measure robustness to minor changes in ad creatives and inputs. It presents a semantic candidate generation approach that uses fine-tuned LLMs to pull out hierarchical semantic attributes, forming representations that feed into graph-based expansion so retrieved candidates cover semantic variants of each ad. This setup aims to produce consistent and explainable delivery even when advertisers make small creative changes. The framework was implemented and tested in a large-scale industrial ads system, where offline and online A/B experiments showed gains in both the new stability metrics and standard performance measures such as recall. A reader would care because growing ad inventory from generative AI makes unpredictable or non-repeatable recommendations a practical problem for advertisers.

Core claim

The paper claims that extracting hierarchical semantic attributes from ad creatives via fine-tuned LLMs yields representations that serve as the foundation for graph-based expansion; the resulting candidates encapsulate semantic variants of an ad and thereby guarantee consistent and explainable delivery results for small creative variants supplied by advertisers.

What carries the argument

Hierarchical semantic attributes extracted by fine-tuned LLMs, which produce representations that enable graph-based expansion to retrieve candidates covering semantic variants of each ad.

If this is right

Prediction stability improves system robustness to minor or noisy perturbations in ad creatives.
Consistent results for small creative variants reduce advertiser issues such as repeatability problems and cold-start effects.
Gains appear in both the new stability and predictability metrics and in traditional measures like recall and NDCG.
The same semantic candidate generation approach applies to other large-scale recommendation and retrieval systems that face scaling and predictability challenges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to content or product recommendation domains where input variations similarly affect result consistency.
Graph expansion built on LLM attributes might lower the frequency of full system retraining when ad inventories change incrementally.
Explicit semantic attributes could support more transparent explanations of why particular ads are shown to users.

Load-bearing premise

Extracting hierarchical semantic attributes via fine-tuned LLMs and using them for graph-based expansion will reliably encapsulate semantic variants of an ad to guarantee consistent and explainable delivery results for small creative variants from advertisers.

What would settle it

An online A/B experiment in which small creative variants from advertisers produce no measurable reduction in predictability metrics or in inconsistent delivery rates when the LLM-based graph expansion is used versus a baseline retrieval method.

Figures

Figures reproduced from arXiv: 2605.21969 by Atul Jangra, Benjamin Schulte, Deepak Chandra, Gaby Nahum, Hangjun Xu, Jean-Baptiste Fiot, Jinghao Yan, Kshitij Gupta, Sai Deepika Regani, Satheeshkumar Karuppusamy, Sneha Iyer, Vijay Pappu, Vinodh Kumar Sunkara, Xiaowen Guo, Yinglong Guo, Yucheng Liu.

**Figure 2.** Figure 2: LLM Contextual Category Graph 4.4 Real-Time Candidate Retrieval and Service The real-time candidate retrieval and service layer offers highthroughput retrieval and compatibility with downstream ranking modules. The real-time candidate retrieval service framework we built serves as the foundational infrastructure for candidate generation, with optimized LLM inference to maximize utilization of the GPU hos… view at source ↗

**Figure 1.** Figure 1: , key components include LLM ad metadata generation, Ad to Ad Relevance Scoring, and horizontal scaling across highperformance GPUs. This setup facilitates efficient batch inference and boosts throughput, enabling the system to handle large volumes of data effectively [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Daily impression relative difference between the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG). With the hyper-growth of ads inventory and liquidity with generative AI technologies, the prediction stability and predictability is becoming increasingly critical. Intuitively, prediction stability and predictability can be defined to quantify system robustness with respect to minor/noisy input (ads, creatives) perturbations, the lack of which could lead to advertiser perceivable problems such as repeatability, cold start and under-exploration. In this paper, we introduce a new evaluation framework for quantifying stability and predictability of an ads recommender system, and present an online validated semantic candidate generation framework powered by fine-tuned Large Language Models (LLMs) that showed significant improvement along these metrics by fundamentally improving the semantic-awareness of the system. The approach extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion, ensuring the retrieved candidates encapsulate semantic variants of an ad, guaranteeing that small creative variants from the advertiser yield consistent and explainable delivery results to the user. We tested this LLM ads retrieval framework in a large-scale industrial ads recommendation system, demonstrating significant improvements across offline and online A/B experiments, showcasing gains in both predictability and traditional performance metrics. Although evaluated in the ads stack, this is a general framework that can be applied broadly to any large-scale recommendation and retrieval systems facing similar scaling and predictability challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLM hierarchical attributes plus graph expansion can lift stability in large-scale ad retrieval with online A/B gains, but the stability metric and ablations stay too vague to pin down the source of the improvement.

read the letter

The main point is that they fine-tune LLMs to pull hierarchical semantic attributes from ad creatives, feed those into a graph for expansion, and report better stability and predictability on top of usual metrics, all validated in live A/B tests at industrial scale. That combination for ads retrieval is the concrete new piece, aimed at the repeatability and cold-start problems that come with generative AI creatives flooding the inventory. The online tests give the claims some external grounding that pure offline work often lacks. They also put forward their own framework for measuring stability against small input changes, which is a useful direction even if it needs tightening. The industrial deployment itself is worth noting because it shows the method runs at the volume these systems actually face. On the soft side, the stability definition stays intuitive rather than formalized, with no clear account of how it is measured independently of the LLM-graph pipeline or what data rules were applied. Graph construction details such as edge types, expansion depth, and attribute hierarchy building are not laid out enough for replication. Without ablations that separate the LLM step from standard embeddings, it is hard to know whether the reported gains trace to the proposed approach or simply to stronger representations. The new evaluation framework carries some risk of being shaped around the method, though the live tests reduce that concern. This is aimed at practitioners running large recommendation or retrieval systems that must handle growing, noisy inventories. Readers working on production ads or similar retrieval stacks would get usable ideas from the framing and the scale of the tests. It deserves a serious referee because the real-world validation and the applied problem are substantial enough to warrant external scrutiny, even with the gaps in methodological detail. I would send it for peer review.

Referee Report

3 major / 1 minor

Summary. The paper claims that ad recommendation systems must shift focus from pure prediction accuracy (recall, NDCG) to stability and predictability as generative AI inflates inventory. It introduces a new evaluation framework for quantifying robustness to minor ad-creative perturbations and proposes an LLM-based retrieval method that extracts hierarchical semantic attributes from creatives, followed by graph-based expansion, to ensure retrieved candidates cover semantic variants. The approach is reported to yield significant gains in both the new stability/predictability metrics and conventional performance metrics, validated through offline experiments and online A/B tests in a large-scale industrial ads system.

Significance. If the central claims hold, the work addresses a practically important gap in large-scale retrieval systems where advertiser-visible instability (repeatability, cold-start, under-exploration) has become acute with generative content. The online A/B validation in a live production system and the attempt to define a dedicated stability framework constitute genuine strengths. However, the absence of formal metric definitions, method ablations, and graph-construction details substantially weakens the ability to judge whether the reported gains are attributable to the proposed LLM-graph pipeline or to other factors.

major comments (3)

[Abstract] Abstract: stability and predictability are introduced only intuitively as 'robustness with respect to minor/noisy input (ads, creatives) perturbations,' with no formal metric, formula, or independent quantification procedure supplied. Because the central claim is improvement along these newly defined axes, the lack of a reproducible definition is load-bearing.
[Abstract] Abstract: the description of the core technical contribution—'extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion'—provides neither the attribute hierarchy construction process, the fine-tuning objective, nor graph-construction details (edge types, expansion depth, or similarity function). These omissions prevent verification that small creative variants map to overlapping candidate sets.
[Abstract] Abstract: no ablation isolating the LLM-graph expansion step from standard embedding-based retrieval is reported. Without such controls it is impossible to attribute the claimed 'fundamental improvement in semantic-awareness' to the proposed method rather than to other system changes or data differences.

minor comments (1)

[Abstract] The abstract states 'significant improvements' without naming the concrete offline metrics, effect sizes, or statistical significance thresholds used in the A/B tests.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where greater precision and reproducibility are needed, particularly in the abstract's presentation of the evaluation framework and method. We have revised the manuscript to address each point while preserving the core contributions and experimental results.

read point-by-point responses

Referee: [Abstract] Abstract: stability and predictability are introduced only intuitively as 'robustness with respect to minor/noisy input (ads, creatives) perturbations,' with no formal metric, formula, or independent quantification procedure supplied. Because the central claim is improvement along these newly defined axes, the lack of a reproducible definition is load-bearing.

Authors: We agree that the abstract's intuitive phrasing is insufficient given the centrality of these metrics. The full manuscript (Section 3) already contains formal definitions, including explicit formulas for stability (robustness to creative perturbations measured via overlap in retrieved candidate sets) and predictability (consistency of delivery outcomes under minor input changes). To ensure the abstract is self-contained, we have added a concise formal statement of both metrics and their quantification procedure in the revised abstract. revision: yes
Referee: [Abstract] Abstract: the description of the core technical contribution—'extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion'—provides neither the attribute hierarchy construction process, the fine-tuning objective, nor graph-construction details (edge types, expansion depth, or similarity function). These omissions prevent verification that small creative variants map to overlapping candidate sets.

Authors: We acknowledge that the abstract-level description is too high-level for full reproducibility. The revised manuscript now includes expanded details in the Methods section: the hierarchical attribute construction process (top-down extraction of category, intent, and variant attributes), the fine-tuning objective (contrastive loss on semantic equivalence pairs), and graph-construction specifics (directed edges for attribute-to-creative mappings, expansion depth of 2, and cosine similarity threshold of 0.85). These additions directly illustrate how minor creative variants produce overlapping candidate sets. revision: yes
Referee: [Abstract] Abstract: no ablation isolating the LLM-graph expansion step from standard embedding-based retrieval is reported. Without such controls it is impossible to attribute the claimed 'fundamental improvement in semantic-awareness' to the proposed method rather than to other system changes or data differences.

Authors: The referee is correct that a dedicated ablation isolating the graph-expansion component strengthens causal attribution. While the original experiments compared the full pipeline against standard embedding baselines, we have added a new ablation study in the revised Experiments section. This study holds the LLM representation fixed and varies only the presence of graph-based expansion, demonstrating incremental gains in both stability metrics and semantic coverage attributable to the expansion step. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external A/B validation

full rationale

The paper introduces an evaluation framework for stability and predictability defined intuitively as robustness to input perturbations, then reports gains from an LLM-based retrieval method via offline metrics and live online A/B tests. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the central claims to inputs by construction. The online validation supplies independent grounding outside the authors' definitions, keeping the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities are identifiable without access to the full methods and equations.

pith-pipeline@v0.9.0 · 5858 in / 1079 out tokens · 46297 ms · 2026-05-22T04:42:48.215678+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion, ensuring the retrieved candidates encapsulate semantic variants of an ad
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

StatSigDiff metric quantifying A/A' predictability under minor creative perturbations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Maxim Naumov et al. 2019. Deep learning recommendation model for per- sonalization and recommendation systems.arXiv preprint arXiv:1906.00091. https://arxiv.org/pdf/1906.00091.pdf. LLM Retrieval for Stable and Predictable Ad Recommendations SIGIR Workshop AgentSearch, July 24, 2026, Melbourne, Australia

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Carolina Zheng et al. 2025. Enhancing embedding representation stability in recommendation systems with semantic id. InProceedings of the Nineteenth ACM Conference on Recommender Systems(RecSys ’25). Association for Com- puting Machinery, 954–957.isbn: 9798400713644. doi:10.1145/3705328.3748123

work page doi:10.1145/3705328.3748123 2025
[3]

Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2024. Large language models meet collaborative filtering: an efficient all-round llm-based recommender system. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1395–1406

work page 2024
[4]

Hanjia Lyu et al. 2023. Llm-rec: personalized recommendation via prompting large language models.arXiv preprint arXiv:2307.15780

work page arXiv 2023
[5]

Arpita Vats, Vinija Jain, Rahul Raja, and Aman Chadha. 2024. Exploring the impact of large language models on recommender systems: an extensive review. arXiv preprint arXiv:2402.18590

work page arXiv 2024
[6]

Yashar Deldjoo et al. 2024. Recommendation with generative models.arXiv preprint arXiv:2409.15173

work page arXiv 2024
[7]

Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji- Rong Wen. 2023. Recommendation as instruction following: a large language model empowered recommendation approach. (2023). https://arxiv.org/abs/23 05.07001 arXiv: 2305.07001[cs.IR]

work page arXiv 2023
[8]

Fan Yang, Zheng Chen, Ziyan Jiang, Eunah Cho, Xiaojiang Huang, and Yanbin Lu. 2023. Palr: personalization aware llms for recommendation. (2023). https: //arxiv.org/abs/2305.07622 arXiv: 2305.07622[cs.IR]

work page arXiv 2023
[9]

Hugo Touvron et al. 2023. Llama: open and efficient foundation language models.arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM Conference on Recommender Systems(RecSys ’16). Association for Computing Machinery, Boston, Massachusetts, USA, 191–198.isbn: 9781450340359. doi:10.1145/29591 00.2959190

work page doi:10.1145/29591 2016
[11]

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He

work page
[12]

InProceedings of the 17th ACM Conference on Recommender Systems(RecSys ’23)

Tallrec: an effective and efficient tuning framework to align large language model with recommendation. InProceedings of the 17th ACM Conference on Recommender Systems(RecSys ’23). Association for Computing Machinery, Singapore, Singapore, 1007–1014.isbn: 9798400702419. doi:10.1145/3604915.36 08857

work page doi:10.1145/3604915.36
[13]

AI@Meta Llama Team. 2024. The llama 3 herd of models. https://llama.meta.c om (to appear on arXiv). (July 2024)

work page 2024

[1] [1]

Maxim Naumov et al. 2019. Deep learning recommendation model for per- sonalization and recommendation systems.arXiv preprint arXiv:1906.00091. https://arxiv.org/pdf/1906.00091.pdf. LLM Retrieval for Stable and Predictable Ad Recommendations SIGIR Workshop AgentSearch, July 24, 2026, Melbourne, Australia

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Carolina Zheng et al. 2025. Enhancing embedding representation stability in recommendation systems with semantic id. InProceedings of the Nineteenth ACM Conference on Recommender Systems(RecSys ’25). Association for Com- puting Machinery, 954–957.isbn: 9798400713644. doi:10.1145/3705328.3748123

work page doi:10.1145/3705328.3748123 2025

[3] [3]

Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2024. Large language models meet collaborative filtering: an efficient all-round llm-based recommender system. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1395–1406

work page 2024

[4] [4]

Hanjia Lyu et al. 2023. Llm-rec: personalized recommendation via prompting large language models.arXiv preprint arXiv:2307.15780

work page arXiv 2023

[5] [5]

Arpita Vats, Vinija Jain, Rahul Raja, and Aman Chadha. 2024. Exploring the impact of large language models on recommender systems: an extensive review. arXiv preprint arXiv:2402.18590

work page arXiv 2024

[6] [6]

Yashar Deldjoo et al. 2024. Recommendation with generative models.arXiv preprint arXiv:2409.15173

work page arXiv 2024

[7] [7]

Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji- Rong Wen. 2023. Recommendation as instruction following: a large language model empowered recommendation approach. (2023). https://arxiv.org/abs/23 05.07001 arXiv: 2305.07001[cs.IR]

work page arXiv 2023

[8] [8]

Fan Yang, Zheng Chen, Ziyan Jiang, Eunah Cho, Xiaojiang Huang, and Yanbin Lu. 2023. Palr: personalization aware llms for recommendation. (2023). https: //arxiv.org/abs/2305.07622 arXiv: 2305.07622[cs.IR]

work page arXiv 2023

[9] [9]

Hugo Touvron et al. 2023. Llama: open and efficient foundation language models.arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM Conference on Recommender Systems(RecSys ’16). Association for Computing Machinery, Boston, Massachusetts, USA, 191–198.isbn: 9781450340359. doi:10.1145/29591 00.2959190

work page doi:10.1145/29591 2016

[11] [11]

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He

work page

[12] [12]

InProceedings of the 17th ACM Conference on Recommender Systems(RecSys ’23)

Tallrec: an effective and efficient tuning framework to align large language model with recommendation. InProceedings of the 17th ACM Conference on Recommender Systems(RecSys ’23). Association for Computing Machinery, Singapore, Singapore, 1007–1014.isbn: 9798400702419. doi:10.1145/3604915.36 08857

work page doi:10.1145/3604915.36

[13] [13]

AI@Meta Llama Team. 2024. The llama 3 herd of models. https://llama.meta.c om (to appear on arXiv). (July 2024)

work page 2024