Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

Keller Maloney; Noah Lehman; Sarah Xu; Will Jack

arxiv: 2605.27439 · v1 · pith:H2E57U43new · submitted 2026-05-22 · 💻 cs.IR · cs.AI

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

Will Jack , Noah Lehman , Keller Maloney , Sarah Xu This is my paper

Pith reviewed 2026-06-30 14:49 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords AI recommendationsbrand prominenceretrieval-augmented generationcommercial queriesfailure modesmodel auditpersona effectstier stratification

0 comments

The pith

AI commercial recommendation failure modes differ sharply by brand prominence tier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits roughly 37,000 production runs of four AI model configurations on 215 commercial prompts to establish that outcomes split by a five-tier prominence ladder rather than showing uniform behavior. L1 brands reach retrievals in nearly every relevant case but convert in only 25-41 percent of slots. L2 brands post the highest conversion rates yet lose ground to persona substitution on some models. L3 marks an inflection with coverage falling to 88 percent and persona effects peaking, while L4 and L5 brands remain invisible in 48-52 percent of runs. A sympathetic reader would care because marketing to AI assistants therefore requires different investments depending on where a brand sits on the ladder.

Core claim

In retrieval-augmented commercial recommendations, the failure mode is tier-specific: L1 brands appear in nearly every relevant retrieval but win only 25-41 percent of the recommendation slots they reach; L2 challengers post the highest conversion rates (37-52 percent) yet lose to persona-mediated substitution on Anthropic models; L3 mid-market brands form the inflection level with aggregate coverage at 88 percent, conversion at 34-40 percent, and peak persona effects; L4 specialists and L5 regional players face catastrophic invisibility with 48-52 percent never surfacing in any of the 37,000 runs. No uniform optimization recipe succeeds across tiers.

What carries the argument

The five-tier prominence ladder (L1 category leaders to L5 regional players) that stratifies the 533-brand catalog and exposes differentiated retrieval and conversion rates.

If this is right

L1 brands must prioritize differentiation over visibility to convert retrieved appearances into wins.
L2 brands must counter persona-mediated substitution to retain their high conversion rates.
L3 brands sit at the inflection where both coverage and persona effects require simultaneous attention.
L4 and L5 brands confront fundamental invisibility that no standard optimization appears to overcome.
Marketing investment for AI assistants must be chosen according to the brand's position on the prominence ladder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tier stratification might reveal analogous visibility and substitution patterns in other AI-mediated commercial decisions such as product comparison or pricing advice.
If real-world user prompts differ systematically from the audited set, the reported inflection at L3 could shift to a different tier.
Low-tier brands might test whether niche content strategies can reduce the 48-52 percent invisibility rate observed here.

Load-bearing premise

The 215 commercially-framed prompts and the 533-brand catalog stratified from external authority lists are representative of actual user commercial queries and brand awareness footprints.

What would settle it

A replication using a different prompt set drawn from real user query logs that produces uniform conversion and visibility rates across all five tiers would falsify the tier-specific failure claim.

read the original abstract

AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating brands rather than returning a list of links. Marketing to AI is therefore a broader problem than "show up in search" -- positioning, content, and product fit matter as much as discoverability. We audit ~37,000 production runs across four model configurations and 215 commercially-framed prompts spanning 19 sectors, evaluated against a 533-brand reference catalog stratified into five prominence tiers (L1 category leaders to L5 regional players) sourced from external authority lists. The ladder proxies a brand's awareness footprint within its sector, not revenue or market share. The failure mode differs sharply by tier. L1 brands appear in nearly every relevant retrieval but win only 25-41% of the recommendation slots they reach -- the leverage is differentiation, not visibility. L2 challengers carry the highest conversion rates of any tier (37-52%) but lose to persona-mediated substitution on the Anthropic models. L3 mid-market brands are the inflection level: aggregate coverage drops to 88%, conversion to 34-40%, and persona effects peak. L4 specialists and L5 regional players face catastrophic invisibility -- 48-52% never surface in any of the 37,000 runs. No uniform optimization recipe wins; the right marketing investment depends on where the brand sits on the prominence ladder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's scale and tier stratification deliver a concrete empirical map of AI recommendation failures, but the unvalidated prompts and brand lists make the tier differences hard to generalize.

read the letter

The main thing to know is that this audit maps out how AI models handle commercial brand recommendations across 37,000 runs, with clear differences by prominence tier: L1 brands get seen but win few slots, L2 converts well but gets substituted on some models, L3 sits at an inflection point, and L4/L5 brands often never appear at all.

What is new is the explicit five-tier stratification drawn from external authority lists, applied to a 533-brand catalog and 215 prompts across 19 sectors. The numbers on coverage and conversion rates per tier, plus the model-specific patterns like persona effects on Anthropic, give a more granular picture than prior search or recsys audits.

The work is straightforward about its scope and reports the tier differences directly from the runs. That empirical volume is the real contribution here.

The soft spot is exactly the one flagged in the stress test: the 215 prompts and tier definitions are treated as representative without any described validation against real query logs, user studies, or alternative lists. If those choices overweight certain sectors or misalign with actual awareness, the reported failure rates become specific to this setup rather than a general structure. The abstract also skips details on prompt construction, error bars, or statistical controls, which leaves the central claims harder to evaluate.

This is for people working on commercial IR, AI-mediated recommendation, or marketing strategy around language models. A reader who wants large-scale data on how prominence affects outcomes will find the tier breakdowns useful.

It deserves peer review because the scale and stratification are substantive enough to warrant referee time, even with the methodology questions that will come up.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical audit of ~37,000 production runs of four AI model configurations on 215 commercially-framed prompts across 19 sectors, evaluated against a 533-brand catalog stratified into five prominence tiers (L1 category leaders to L5 regional players) drawn from external authority lists. It claims that recommendation failure modes differ sharply by tier: L1 brands achieve near-universal retrieval but only 25-41% win rates; L2 challengers show the highest conversion (37-52%) but suffer persona-mediated substitution on Anthropic models; L3 is an inflection point with 88% coverage and 34-40% conversion; L4/L5 brands exhibit 48-52% total invisibility. No uniform optimization strategy works; strategy must be tier-dependent.

Significance. If the tier-stratified patterns are robust, the work provides a large-scale empirical map of how prominence interacts with AI recommendation mechanics, with direct implications for marketing investment allocation. The scale (37k runs) and external catalog grounding are strengths; the absence of parameter fitting or self-referential derivations supports the audit framing.

major comments (2)

[Methodology / prompt and catalog construction] Methodology (prompt construction and tier stratification sections): the central tier-specific rates (L1 25-41% win, L2 37-52% conversion, L3 88%/34-40%, L4/L5 48-52% invisibility) rest on a fixed set of 215 prompts and external-list tiering with no described validation against query logs, user studies, or alternative stratifications; if the prompt distribution or tier definitions are unrepresentative, all reported differences become sample artifacts rather than general structure.
[Results / aggregate statistics] Results sections reporting aggregate statistics: no information is provided on statistical testing, error bars, confidence intervals, or controls for model-specific artifacts and prompt variability, preventing evaluation of whether the reported tier differences exceed sampling noise.

minor comments (2)

[Abstract / Methods] Clarify in the abstract and methods whether the 215 prompts were manually authored or derived from templates, and list the exact external authority sources used for the 533-brand tiering.
[Appendix or supplementary material] Add a table or appendix showing per-sector prompt counts and brand distribution across tiers to allow assessment of balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on methodology and statistical reporting. We address each major point below and outline revisions to strengthen the audit's transparency and rigor while preserving its empirical framing.

read point-by-point responses

Referee: [Methodology / prompt and catalog construction] Methodology (prompt construction and tier stratification sections): the central tier-specific rates (L1 25-41% win, L2 37-52% conversion, L3 88%/34-40%, L4/L5 48-52% invisibility) rest on a fixed set of 215 prompts and external-list tiering with no described validation against query logs, user studies, or alternative stratifications; if the prompt distribution or tier definitions are unrepresentative, all reported differences become sample artifacts rather than general structure.

Authors: We agree that explicit validation steps would strengthen generalizability claims. The 215 prompts were constructed to span 19 sectors with commercial framing (e.g., purchase-intent phrasing) drawn from common consumer query patterns, and tiers were assigned via external authority lists (Forbes, Interbrand, sector-specific rankings) to avoid circularity. However, the manuscript does not report cross-validation against real query logs or alternative tierings. In revision we will expand the prompt-construction subsection with explicit rationale and examples, add a dedicated limitations paragraph acknowledging the absence of log-based or user-study validation, and note that the audit is intended as a large-scale empirical map rather than a statistically representative sample of all possible queries. revision: partial
Referee: [Results / aggregate statistics] Results sections reporting aggregate statistics: no information is provided on statistical testing, error bars, confidence intervals, or controls for model-specific artifacts and prompt variability, preventing evaluation of whether the reported tier differences exceed sampling noise.

Authors: The referee correctly identifies a gap in the current reporting. While the 37,000-run scale provides descriptive stability, the manuscript presents raw percentages without inferential statistics or uncertainty quantification. In the revised version we will add (1) bootstrap-derived 95% confidence intervals for all tier-level rates, (2) pairwise chi-square or Fisher's exact tests (with multiplicity correction) comparing conversion and invisibility rates across tiers, and (3) a brief analysis of prompt-level variance and model-specific effects to demonstrate that the reported tier patterns are not artifacts of single-prompt or single-model noise. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical counts from fixed external inputs

full rationale

The paper reports direct empirical measurements (appearance rates, conversion rates, coverage) computed from 37,000 model runs on a fixed set of 215 prompts and a 533-brand catalog drawn from external authority lists. No equations, fitted parameters, predictions derived from the data, or self-citations appear in the provided text. Tier definitions and prompt framing are treated as inputs, not outputs of the analysis. The skeptic concern addresses external validity of the sample, not internal reduction of claims to the paper's own definitions or fits. This is a standard non-circular empirical audit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the external brand lists providing a valid proxy for awareness and on the chosen prompts representing commercial queries; no free parameters are fitted inside the study itself.

axioms (2)

domain assumption Prominence tiers sourced from external authority lists accurately proxy a brand's awareness footprint within its sector.
This premise is invoked to interpret the differing failure modes across L1-L5 and is stated in the abstract description of the reference catalog.
domain assumption The 215 prompts and four model configurations capture the relevant variation in real-world AI commercial recommendation behavior.
Used to generalize the observed tier-specific patterns from the 37,000 runs.

pith-pipeline@v0.9.1-grok · 5795 in / 1541 out tokens · 41550 ms · 2026-06-30T14:49:22.786437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages

[1]

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026a

Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026a
[2]

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b

Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b
[3]

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
[4]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024
[5]

Revealing Potential Biases in LLM-Based Recommender Systems in the Cold Start Setting

Andre, A., Roy, G., Dyer, E., Wang, K. Revealing Potential Biases in LLM-Based Recommender Systems in the Cold Start Setting. arXiv:2508.20401, 2025

work page arXiv 2025
[6]

A Survey on Popularity Bias in Recommender Systems.User Modeling and User-Adapted Interaction, 2024

Klimashevskaia, A., Jannach, D., Elahi, M., Trattner, C. A Survey on Popularity Bias in Recommender Systems.User Modeling and User-Adapted Interaction, 2024. arXiv:2308.01118

work page arXiv 2024
[7]

Generative Engine Optimization: How to Dominate AI Search

Chen, M., Wang, X., Chen, K., Koudas, N. Generative Engine Optimization: How to Dominate AI Search. arXiv:2509.08919, 2025

work page arXiv 2025
[8]

Court, D., Elzinga, D., Mulder, S., Vetvik, O. J. The Consumer Decision Journey.McKinsey Quarterly, 2009

2009
[9]

NEW Research: AIs are highly inconsistent when recommending brands or products; marketers should take care when tracking AI visibility

Fishkin, R. NEW Research: AIs are highly inconsistent when recommending brands or products; marketers should take care when tracking AI visibility. SparkToro, 2026

2026
[10]

Keller, K. L. Conceptualizing, Measuring, and Managing Customer-Based Brand Equity. Journal of Marketing, 1993

1993
[11]

Lewis, E. S. AIDA model. Foundational concept, 1898
[12]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

2023
[13]

A Survey of Long-Tail Item Recommendation Methods.Wireless Communications and Mobile Computing, 2021

Qin, J., Liu, M., Liu, X. A Survey of Long-Tail Item Recommendation Methods.Wireless Communications and Mobile Computing, 2021

2021
[14]

Auditing Preferences for Brands and Cultures in LLMs

Rienecker, J., Mpofu, K., Goel, N., Datta, S., Zhao, J., Danielsson, O., Thorsen, F. Auditing Preferences for Brands and Cultures in LLMs. arXiv:2603.18300, 2026

work page arXiv 2026
[15]

M., Buchholz, A., Schwöbel, P

Lichtenberg, J. M., Buchholz, A., Schwöbel, P. Large Language Models as Recommender Systems: A Study of Popularity Bias.Gen-IR Workshop at SIGIR, 2024. arXiv:2406.01285

work page arXiv 2024
[16]

87% of SearchGPT Citations Match Bing’s Top Results

Blake, C., Scharf, A. 87% of SearchGPT Citations Match Bing’s Top Results. Seer Interactive, 2025

2025
[17]

Challenging the Long Tail Recommendation.VLDB, 2012

Yin, H., Cui, B., Li, J., Yao, J., Chen, C. Challenging the Long Tail Recommendation.VLDB, 2012. 15

2012

[1] [1]

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026a

Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026a

[2] [2]

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b

Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b

[3] [3]

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

[4] [4]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024

[5] [5]

Revealing Potential Biases in LLM-Based Recommender Systems in the Cold Start Setting

Andre, A., Roy, G., Dyer, E., Wang, K. Revealing Potential Biases in LLM-Based Recommender Systems in the Cold Start Setting. arXiv:2508.20401, 2025

work page arXiv 2025

[6] [6]

A Survey on Popularity Bias in Recommender Systems.User Modeling and User-Adapted Interaction, 2024

Klimashevskaia, A., Jannach, D., Elahi, M., Trattner, C. A Survey on Popularity Bias in Recommender Systems.User Modeling and User-Adapted Interaction, 2024. arXiv:2308.01118

work page arXiv 2024

[7] [7]

Generative Engine Optimization: How to Dominate AI Search

Chen, M., Wang, X., Chen, K., Koudas, N. Generative Engine Optimization: How to Dominate AI Search. arXiv:2509.08919, 2025

work page arXiv 2025

[8] [8]

Court, D., Elzinga, D., Mulder, S., Vetvik, O. J. The Consumer Decision Journey.McKinsey Quarterly, 2009

2009

[9] [9]

NEW Research: AIs are highly inconsistent when recommending brands or products; marketers should take care when tracking AI visibility

Fishkin, R. NEW Research: AIs are highly inconsistent when recommending brands or products; marketers should take care when tracking AI visibility. SparkToro, 2026

2026

[10] [10]

Keller, K. L. Conceptualizing, Measuring, and Managing Customer-Based Brand Equity. Journal of Marketing, 1993

1993

[11] [11]

Lewis, E. S. AIDA model. Foundational concept, 1898

[12] [12]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

2023

[13] [13]

A Survey of Long-Tail Item Recommendation Methods.Wireless Communications and Mobile Computing, 2021

Qin, J., Liu, M., Liu, X. A Survey of Long-Tail Item Recommendation Methods.Wireless Communications and Mobile Computing, 2021

2021

[14] [14]

Auditing Preferences for Brands and Cultures in LLMs

Rienecker, J., Mpofu, K., Goel, N., Datta, S., Zhao, J., Danielsson, O., Thorsen, F. Auditing Preferences for Brands and Cultures in LLMs. arXiv:2603.18300, 2026

work page arXiv 2026

[15] [15]

M., Buchholz, A., Schwöbel, P

Lichtenberg, J. M., Buchholz, A., Schwöbel, P. Large Language Models as Recommender Systems: A Study of Popularity Bias.Gen-IR Workshop at SIGIR, 2024. arXiv:2406.01285

work page arXiv 2024

[16] [16]

87% of SearchGPT Citations Match Bing’s Top Results

Blake, C., Scharf, A. 87% of SearchGPT Citations Match Bing’s Top Results. Seer Interactive, 2025

2025

[17] [17]

Challenging the Long Tail Recommendation.VLDB, 2012

Yin, H., Cui, B., Li, J., Yao, J., Chen, C. Challenging the Long Tail Recommendation.VLDB, 2012. 15

2012