pith. sign in

arxiv: 2605.30207 · v1 · pith:53KHZKKKnew · submitted 2026-05-28 · 💻 cs.AI

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

Pith reviewed 2026-06-29 07:18 UTC · model grok-4.3

classification 💻 cs.AI
keywords persona conditioningbrand recommendationsretrieval-augmented generationAI chatbotsJaccard similarityprominence stratificationcommercial queries
0
0 comments X

The pith

Prefixing the same query with different buyer personas drops AI recommendation-set overlap by 0.12-0.20 in Jaccard index.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether buyer context, signaled by a persona prefix, changes which brands commercial AI chatbots recommend for identical prompts. Across 2000 runs on OpenAI and Anthropic models, it measures recommendation-set similarity and finds consistent drops when personas differ. The effect concentrates on mid-market brands while category leaders remain stable. The audit shows that aggregating recommendations without conditioning on persona masks real variation in model output.

Core claim

The same prompt produces materially different recommendation sets depending on who the model thinks is asking, with the effect sharply stratified by brand prominence and largest on the most priors-reliant generation route.

What carries the argument

Jaccard similarity computed on persona-conditioned recommendation sets, stratified by brand prominence category.

Load-bearing premise

The ten chosen personas accurately represent distinct buyer contexts and models treat the prefixes only as identity signals.

What would settle it

Repeating the audit with a fresh set of personas and finding all clustered confidence intervals for the Jaccard delta include zero.

read the original abstract

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper audits how buyer personas affect brand recommendations in retrieval-augmented commercial chat systems. It samples 2000 runs across 10 personas × 8 prompts × 3 model configurations (OpenAI high/low and Anthropic sonnet-4.6/low, with partial coverage in one cell) and reports that persona prefixes reduce recommendation-set Jaccard similarity by Δ = -0.12 to -0.20 relative to same-persona baselines (clustered 95% CIs exclude zero). The effect is prominence-stratified (category leaders ~80% consistent; mid-market brands swap up to 75%), larger in the Anthropic configuration, and attributed to differences in retrieval attribution rates.

Significance. If the central empirical deltas hold after controls for prompt artifacts, the result shows that measurements of AI brand perception must condition on buyer persona, as aggregated protocols obscure material variation concentrated at mid-market brands. The prominence stratification and cross-provider comparison (with note on retrieval-unattributed generation) provide a concrete, falsifiable demonstration that context integration strength modulates recommendation stability. The clustered-CI design and explicit coverage limitations are strengths that improve audit transparency.

major comments (2)
  1. [Abstract (design space description and effect attribution)] The design (10 personas × 8 prompts) does not report controls or ablations for systematic differences in prefix length, lexical overlap with the query, or syntactic framing across the 10 personas. Without such matching, the observed Jaccard drops cannot be securely attributed to buyer-identity conditioning rather than prompt-phrasing confounds that could alter retrieval scores or generation priors independently of the intended persona signal.
  2. [Abstract (asymmetry paragraph)] The interpretation that the Anthropic vs. OpenAI asymmetry is consistent with retrieval-unattributed generation (43-52% vs. 8-29%) rests on rates documented in Jack 2026. While the empirical deltas are independent measurements, the explanatory claim for why the effect is larger on one route reduces in part to that prior result; direct within-study attribution measurements or clearer separation of the descriptive claim from the causal interpretation would strengthen the argument.
minor comments (2)
  1. [Abstract] The sonnet cell's CI is noted as resting on only 4 prompt clusters; a table or appendix explicitly listing per-cell prompt coverage and cluster counts would improve reproducibility.
  2. [Abstract] The abstract states N=10 reps but does not specify whether the clustered CIs account for prompt-level or persona-level clustering; a brief methods note on the clustering structure would clarify the statistical procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on attribution and design controls. We address each point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract (design space description and effect attribution)] The design (10 personas × 8 prompts) does not report controls or ablations for systematic differences in prefix length, lexical overlap with the query, or syntactic framing across the 10 personas. Without such matching, the observed Jaccard drops cannot be securely attributed to buyer-identity conditioning rather than prompt-phrasing confounds that could alter retrieval scores or generation priors independently of the intended persona signal.

    Authors: We agree the manuscript does not report quantitative controls or ablations for prefix length, lexical overlap, or syntactic framing. Personas were constructed to vary primarily on buyer identity with fixed core queries, but without explicit matching this leaves room for prompt artifacts. In the revised version we will add (i) summary statistics on prefix lengths and token overlap across the 10 personas and (ii) a sensitivity check re-running a subset of prompts with length-normalized prefixes to test robustness of the reported Jaccard deltas. revision: yes

  2. Referee: [Abstract (asymmetry paragraph)] The interpretation that the Anthropic vs. OpenAI asymmetry is consistent with retrieval-unattributed generation (43-52% vs. 8-29%) rests on rates documented in Jack 2026. While the empirical deltas are independent measurements, the explanatory claim for why the effect is larger on one route reduces in part to that prior result; direct within-study attribution measurements or clearer separation of the descriptive claim from the causal interpretation would strengthen the argument.

    Authors: The Jaccard deltas and the within-study retrieval-attribution percentages we report are measured directly in our runs and do not depend on Jack 2026. The asymmetry is described as consistent with rather than proven by the cited rates. We will revise the abstract and discussion to separate the descriptive finding (larger point estimate on the Anthropic route) from the interpretive discussion, and will explicitly note that a stronger causal link would require additional within-study experiments not present in the current audit. revision: partial

Circularity Check

1 steps flagged

Self-citation load-bearing only for asymmetry explanation; core deltas independent

specific steps
  1. self citation load bearing [Abstract]
    "the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026)"

    The paper's interpretation of the larger Anthropic effect size is justified by citing retrieval attribution statistics from prior work by the lead author (Jack 2026). While the Jaccard deltas themselves are independent observations from the current runs, the explanatory claim for the provider asymmetry reduces to this self-cited result.

full rationale

The paper reports direct experimental measurements of Jaccard drops under persona prefixes across 2000 runs, with no equations, parameter fits, or derivations that reduce to inputs. The sole load-bearing self-citation appears in the abstract to explain the Anthropic/OpenAI point-estimate difference via retrieval attribution rates from Jack 2026. This does not affect the validity of the measured deltas or the prominence-stratified pattern, which rest on the current audit's data. Matches pattern of some self-citation where central empirical claim retains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5899 in / 1257 out tokens · 31962 ms · 2026-06-29T07:18:00.886301+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 16 canonical work pages · 2 internal anchors

  1. [1]

    Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

    Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

  2. [2]

    Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

    Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

  3. [3]

    Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026c

    Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026c

  4. [4]

    Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

    Adomavicius, G., Tuzhilin, A. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

  5. [5]

    GEO: Generative Engine Optimization.KDD ’24, 2024

    Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

  6. [6]

    Bai, X., Wang, A., Sucholutsky, I., Griffiths, T. L. Explicitly Unbiased Large Language Models Still Form Biased Associations.Proceedings of the National Academy of Sciences, 122(8), 2025. 11

  7. [7]

    POSIX: A Prompt Sensitivity Index for Large Language Models.EMNLP Findings, 2024

    Chatterjee, A., Renduchintala, H., Bhatia, S., Chakraborty, T. POSIX: A Prompt Sensitivity Index for Large Language Models.EMNLP Findings, 2024. arXiv:2410.02185

  8. [8]

    Scaling Synthetic Data Creation with 1,000,000,000 Personas

    Ge, T., Chan, X., Wang, X., Yu, D., Mi, H., Yu, D. Scaling Synthetic Data Creation with 1,000,000,000 Personas (PersonaHub). arXiv:2406.20094, 2024

  9. [9]

    Quantifying the Persona Effect in LLM Simulations.ACL, 2024

    Hu, T., Collier, N. Quantifying the Persona Effect in LLM Simulations.ACL, 2024

  10. [10]

    Stereotype or Personalization? User Identity Biases Chatbot Recommendations.ACL Findings, 2025

    Kantharuban, A., Milbauer, J., Sap, M., Strubell, E., Neubig, G. Stereotype or Personalization? User Identity Biases Chatbot Recommendations.ACL Findings, 2025. arXiv:2410.05613

  11. [11]

    LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses.CHI, 2025

    Lin, W., Gerchanovsky, A., Akgul, O., Bauer, L., Fredrikson, M., Wang, Z. LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses.CHI, 2025. arXiv:2406.04755

  12. [12]

    State of What Art? A Call for Multi-Prompt LLM Evaluation.Transactions of the Association for Computational Linguistics (TACL), 2024

    Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.Transactions of the Association for Computational Linguistics (TACL), 2024

  13. [13]

    Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv:2506.07295, 2025

    Li, L., Sleem, L., Gentile, N., Nichil, G., State, R. Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv:2506.07295, 2025

  14. [14]

    J., Xie, X

    Hu, Z., Lian, J., Xiao, Z., Xiong, M., Lei, Y ., Wang, T., Ding, K., Xiao, Z., Yuan, N. J., Xie, X. Population-Aligned Persona Generation for LLM-based Social Simulation. arXiv:2509.10127, 2025

  15. [15]

    The prompt makes the person (a): A systematic evaluation of sociodemo- graphic persona prompting for large language models,

    Lutz, M., Sen, I., Ahnert, G., Rogers, E., Strohmaier, M. The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models. Findings of EMNLP, 2025. arXiv:2507.16076

  16. [16]

    Auditing Preferences for Brands and Cultures in LLMs

    Rienecker, J., Mpofu, K., Goel, N., Datta, S., Zhao, J., Danielsson, O., Thorsen, F. Auditing Preferences for Brands and Cultures in LLMs (ChoiceEval). arXiv:2603.18300, 2026

  17. [17]

    M., Buchholz, A., Schwöbel, P

    Lichtenberg, J. M., Buchholz, A., Schwöbel, P. Large Language Models as Recommender Systems: A Study of Popularity Bias. arXiv:2406.01285, 2024

  18. [18]

    Are Emergent Abilities of Large Language Models a Mirage?NeurIPS, 2023

    Schaeffer, R., Miranda, B., Koyejo, S. Are Emergent Abilities of Large Language Models a Mirage?NeurIPS, 2023

  19. [19]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design.ICLR, 2024. arXiv:2310.11324

  20. [20]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

  21. [21]

    X., Liu, J., Wu, H., Wang, H

    Wang, Y ., Ren, R., Wang, Y ., Zhao, W. X., Liu, J., Wu, H., Wang, H. Unveiling Knowl- edge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation.SIGIR, 2025. arXiv:2505.11995

  22. [22]

    Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation.PNAS Nexus, 4(3), 2025

    An, J., Huang, D., Lin, C., Tai, M. Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation.PNAS Nexus, 4(3), 2025. DOI: 10.1093/pnasnexus/pgaf089

  23. [23]

    Z., Raghunathan, A

    Goyal, S., Baek, C., Kolter, J. Z., Raghunathan, A. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance.ICLR, 2025. arXiv:2410.10796

  24. [24]

    Knowledge conflicts for LLMs: A survey,

    Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y ., Xu, W. Knowledge Conflicts for LLMs: A Survey.EMNLP, 2024. arXiv:2403.08319

  25. [25]

    Gender and Race Bias in Consumer Product Recommendations by Large Language Models.AINA 2025, Lecture Notes in Networks and Systems vol

    Xu, K., Potka, S., Thomo, A. Gender and Race Bias in Consumer Product Recommendations by Large Language Models.AINA 2025, Lecture Notes in Networks and Systems vol. 1210, Springer, 2025. arXiv:2602.08124

  26. [26]

    A Helpful Assistant

    Zheng, M., Pei, J., Logeswaran, L., Lee, M., Jurgens, D. When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models.EMNLP Findings, 2024. 12