Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

Keller Maloney; Noah Lehman; Sarah Xu; Will Jack

arxiv: 2605.30207 · v1 · pith:53KHZKKKnew · submitted 2026-05-28 · 💻 cs.AI

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

Will Jack , Noah Lehman , Keller Maloney , Sarah Xu This is my paper

Pith reviewed 2026-06-29 07:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords persona conditioningbrand recommendationsretrieval-augmented generationAI chatbotsJaccard similarityprominence stratificationcommercial queries

0 comments

The pith

Prefixing the same query with different buyer personas drops AI recommendation-set overlap by 0.12-0.20 in Jaccard index.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether buyer context, signaled by a persona prefix, changes which brands commercial AI chatbots recommend for identical prompts. Across 2000 runs on OpenAI and Anthropic models, it measures recommendation-set similarity and finds consistent drops when personas differ. The effect concentrates on mid-market brands while category leaders remain stable. The audit shows that aggregating recommendations without conditioning on persona masks real variation in model output.

Core claim

The same prompt produces materially different recommendation sets depending on who the model thinks is asking, with the effect sharply stratified by brand prominence and largest on the most priors-reliant generation route.

What carries the argument

Jaccard similarity computed on persona-conditioned recommendation sets, stratified by brand prominence category.

Load-bearing premise

The ten chosen personas accurately represent distinct buyer contexts and models treat the prefixes only as identity signals.

What would settle it

Repeating the audit with a fresh set of personas and finding all clustered confidence intervals for the Jaccard delta include zero.

read the original abstract

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Persona prefixes cut brand rec overlap by 0.12-0.20 Jaccard with mid-market brands shifting most, but the design leaves prompt-form confounds unaddressed.

read the letter

The key things to know are that adding a persona prefix drops recommendation-set Jaccard by 0.12 to 0.20 relative to baseline and that the shift concentrates on mid-market brands while category leaders stay stable around 80 percent. The Anthropic setup shows a larger point estimate than the OpenAI ones, tied to higher rates of retrieval-unattributed generation.

The paper applies a clean audit across 10 personas and 8 prompts, then stratifies the results by brand prominence. That breakdown is the useful part: it turns a generic context-sensitivity claim into a more precise statement about where the variation actually occurs. The cross-provider comparison and the link to retrieval attribution rates give the result a bit more grounding than a single-model study would have.

The soft spot is the absence of any check that the 10 persona prefixes were matched on length, lexical overlap with the query, or syntactic structure. Without that control, some of the measured Delta could trace to prompt surface differences rather than the buyer-context signal the prefixes are supposed to carry. The abstract gives no sign of an ablation or balancing step on this, so the attribution to persona interpretation is not fully locked down. The Anthropic cell also rests on only four prompts, which makes that contrast noisier.

This is aimed at researchers who audit commercial AI systems or study context effects in retrieval-augmented generation. It offers a replicable protocol and effect sizes that others could test or extend.

I would send it to peer review. The core pattern is straightforward enough that referees can examine the methods and data once the full manuscript is available.

Referee Report

2 major / 2 minor

Summary. The paper audits how buyer personas affect brand recommendations in retrieval-augmented commercial chat systems. It samples 2000 runs across 10 personas × 8 prompts × 3 model configurations (OpenAI high/low and Anthropic sonnet-4.6/low, with partial coverage in one cell) and reports that persona prefixes reduce recommendation-set Jaccard similarity by Δ = -0.12 to -0.20 relative to same-persona baselines (clustered 95% CIs exclude zero). The effect is prominence-stratified (category leaders ~80% consistent; mid-market brands swap up to 75%), larger in the Anthropic configuration, and attributed to differences in retrieval attribution rates.

Significance. If the central empirical deltas hold after controls for prompt artifacts, the result shows that measurements of AI brand perception must condition on buyer persona, as aggregated protocols obscure material variation concentrated at mid-market brands. The prominence stratification and cross-provider comparison (with note on retrieval-unattributed generation) provide a concrete, falsifiable demonstration that context integration strength modulates recommendation stability. The clustered-CI design and explicit coverage limitations are strengths that improve audit transparency.

major comments (2)

[Abstract (design space description and effect attribution)] The design (10 personas × 8 prompts) does not report controls or ablations for systematic differences in prefix length, lexical overlap with the query, or syntactic framing across the 10 personas. Without such matching, the observed Jaccard drops cannot be securely attributed to buyer-identity conditioning rather than prompt-phrasing confounds that could alter retrieval scores or generation priors independently of the intended persona signal.
[Abstract (asymmetry paragraph)] The interpretation that the Anthropic vs. OpenAI asymmetry is consistent with retrieval-unattributed generation (43-52% vs. 8-29%) rests on rates documented in Jack 2026. While the empirical deltas are independent measurements, the explanatory claim for why the effect is larger on one route reduces in part to that prior result; direct within-study attribution measurements or clearer separation of the descriptive claim from the causal interpretation would strengthen the argument.

minor comments (2)

[Abstract] The sonnet cell's CI is noted as resting on only 4 prompt clusters; a table or appendix explicitly listing per-cell prompt coverage and cluster counts would improve reproducibility.
[Abstract] The abstract states N=10 reps but does not specify whether the clustered CIs account for prompt-level or persona-level clustering; a brief methods note on the clustering structure would clarify the statistical procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on attribution and design controls. We address each point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract (design space description and effect attribution)] The design (10 personas × 8 prompts) does not report controls or ablations for systematic differences in prefix length, lexical overlap with the query, or syntactic framing across the 10 personas. Without such matching, the observed Jaccard drops cannot be securely attributed to buyer-identity conditioning rather than prompt-phrasing confounds that could alter retrieval scores or generation priors independently of the intended persona signal.

Authors: We agree the manuscript does not report quantitative controls or ablations for prefix length, lexical overlap, or syntactic framing. Personas were constructed to vary primarily on buyer identity with fixed core queries, but without explicit matching this leaves room for prompt artifacts. In the revised version we will add (i) summary statistics on prefix lengths and token overlap across the 10 personas and (ii) a sensitivity check re-running a subset of prompts with length-normalized prefixes to test robustness of the reported Jaccard deltas. revision: yes
Referee: [Abstract (asymmetry paragraph)] The interpretation that the Anthropic vs. OpenAI asymmetry is consistent with retrieval-unattributed generation (43-52% vs. 8-29%) rests on rates documented in Jack 2026. While the empirical deltas are independent measurements, the explanatory claim for why the effect is larger on one route reduces in part to that prior result; direct within-study attribution measurements or clearer separation of the descriptive claim from the causal interpretation would strengthen the argument.

Authors: The Jaccard deltas and the within-study retrieval-attribution percentages we report are measured directly in our runs and do not depend on Jack 2026. The asymmetry is described as consistent with rather than proven by the cited rates. We will revise the abstract and discussion to separate the descriptive finding (larger point estimate on the Anthropic route) from the interpretive discussion, and will explicitly note that a stronger causal link would require additional within-study experiments not present in the current audit. revision: partial

Circularity Check

1 steps flagged

Self-citation load-bearing only for asymmetry explanation; core deltas independent

specific steps

self citation load bearing [Abstract]
"the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026)"

The paper's interpretation of the larger Anthropic effect size is justified by citing retrieval attribution statistics from prior work by the lead author (Jack 2026). While the Jaccard deltas themselves are independent observations from the current runs, the explanatory claim for the provider asymmetry reduces to this self-cited result.

full rationale

The paper reports direct experimental measurements of Jaccard drops under persona prefixes across 2000 runs, with no equations, parameter fits, or derivations that reduce to inputs. The sole load-bearing self-citation appears in the abstract to explain the Anthropic/OpenAI point-estimate difference via retrieval attribution rates from Jack 2026. This does not affect the validity of the measured deltas or the prominence-stratified pattern, which rest on the current audit's data. Matches pattern of some self-citation where central empirical claim retains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5899 in / 1257 out tokens · 31962 ms · 2026-06-29T07:18:00.886301+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
[2]

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b
[3]

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026c

Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026c
[4]

Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

Adomavicius, G., Tuzhilin, A. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

2005
[5]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024
[6]

Bai, X., Wang, A., Sucholutsky, I., Griffiths, T. L. Explicitly Unbiased Large Language Models Still Form Biased Associations.Proceedings of the National Academy of Sciences, 122(8), 2025. 11

2025
[7]

POSIX: A Prompt Sensitivity Index for Large Language Models.EMNLP Findings, 2024

Chatterjee, A., Renduchintala, H., Bhatia, S., Chakraborty, T. POSIX: A Prompt Sensitivity Index for Large Language Models.EMNLP Findings, 2024. arXiv:2410.02185

work page arXiv 2024
[8]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Ge, T., Chan, X., Wang, X., Yu, D., Mi, H., Yu, D. Scaling Synthetic Data Creation with 1,000,000,000 Personas (PersonaHub). arXiv:2406.20094, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Quantifying the Persona Effect in LLM Simulations.ACL, 2024

Hu, T., Collier, N. Quantifying the Persona Effect in LLM Simulations.ACL, 2024

2024
[10]

Stereotype or Personalization? User Identity Biases Chatbot Recommendations.ACL Findings, 2025

Kantharuban, A., Milbauer, J., Sap, M., Strubell, E., Neubig, G. Stereotype or Personalization? User Identity Biases Chatbot Recommendations.ACL Findings, 2025. arXiv:2410.05613

work page arXiv 2025
[11]

LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses.CHI, 2025

Lin, W., Gerchanovsky, A., Akgul, O., Bauer, L., Fredrikson, M., Wang, Z. LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses.CHI, 2025. arXiv:2406.04755

work page arXiv 2025
[12]

State of What Art? A Call for Multi-Prompt LLM Evaluation.Transactions of the Association for Computational Linguistics (TACL), 2024

Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.Transactions of the Association for Computational Linguistics (TACL), 2024

2024
[13]

Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv:2506.07295, 2025

Li, L., Sleem, L., Gentile, N., Nichil, G., State, R. Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv:2506.07295, 2025

work page arXiv 2025
[14]

J., Xie, X

Hu, Z., Lian, J., Xiao, Z., Xiong, M., Lei, Y ., Wang, T., Ding, K., Xiao, Z., Yuan, N. J., Xie, X. Population-Aligned Persona Generation for LLM-based Social Simulation. arXiv:2509.10127, 2025

work page arXiv 2025
[15]

The prompt makes the person (a): A systematic evaluation of sociodemo- graphic persona prompting for large language models,

Lutz, M., Sen, I., Ahnert, G., Rogers, E., Strohmaier, M. The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models. Findings of EMNLP, 2025. arXiv:2507.16076

work page arXiv 2025
[16]

Auditing Preferences for Brands and Cultures in LLMs

Rienecker, J., Mpofu, K., Goel, N., Datta, S., Zhao, J., Danielsson, O., Thorsen, F. Auditing Preferences for Brands and Cultures in LLMs (ChoiceEval). arXiv:2603.18300, 2026

work page arXiv 2026
[17]

M., Buchholz, A., Schwöbel, P

Lichtenberg, J. M., Buchholz, A., Schwöbel, P. Large Language Models as Recommender Systems: A Study of Popularity Bias. arXiv:2406.01285, 2024

work page arXiv 2024
[18]

Are Emergent Abilities of Large Language Models a Mirage?NeurIPS, 2023

Schaeffer, R., Miranda, B., Koyejo, S. Are Emergent Abilities of Large Language Models a Mirage?NeurIPS, 2023

2023
[19]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design.ICLR, 2024. arXiv:2310.11324

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

2023
[21]

X., Liu, J., Wu, H., Wang, H

Wang, Y ., Ren, R., Wang, Y ., Zhao, W. X., Liu, J., Wu, H., Wang, H. Unveiling Knowl- edge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation.SIGIR, 2025. arXiv:2505.11995

work page arXiv 2025
[22]

Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation.PNAS Nexus, 4(3), 2025

An, J., Huang, D., Lin, C., Tai, M. Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation.PNAS Nexus, 4(3), 2025. DOI: 10.1093/pnasnexus/pgaf089

work page doi:10.1093/pnasnexus/pgaf089 2025
[23]

Z., Raghunathan, A

Goyal, S., Baek, C., Kolter, J. Z., Raghunathan, A. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance.ICLR, 2025. arXiv:2410.10796

work page arXiv 2025
[24]

Knowledge conflicts for LLMs: A survey,

Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y ., Xu, W. Knowledge Conflicts for LLMs: A Survey.EMNLP, 2024. arXiv:2403.08319

work page arXiv 2024
[25]

Gender and Race Bias in Consumer Product Recommendations by Large Language Models.AINA 2025, Lecture Notes in Networks and Systems vol

Xu, K., Potka, S., Thomo, A. Gender and Race Bias in Consumer Product Recommendations by Large Language Models.AINA 2025, Lecture Notes in Networks and Systems vol. 1210, Springer, 2025. arXiv:2602.08124

work page arXiv 2025
[26]

A Helpful Assistant

Zheng, M., Pei, J., Logeswaran, L., Lee, M., Jurgens, D. When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models.EMNLP Findings, 2024. 12

2024

[1] [1]

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

[2] [2]

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

[3] [3]

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026c

Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026c

[4] [4]

Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

Adomavicius, G., Tuzhilin, A. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

2005

[5] [5]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024

[6] [6]

Bai, X., Wang, A., Sucholutsky, I., Griffiths, T. L. Explicitly Unbiased Large Language Models Still Form Biased Associations.Proceedings of the National Academy of Sciences, 122(8), 2025. 11

2025

[7] [7]

POSIX: A Prompt Sensitivity Index for Large Language Models.EMNLP Findings, 2024

Chatterjee, A., Renduchintala, H., Bhatia, S., Chakraborty, T. POSIX: A Prompt Sensitivity Index for Large Language Models.EMNLP Findings, 2024. arXiv:2410.02185

work page arXiv 2024

[8] [8]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Ge, T., Chan, X., Wang, X., Yu, D., Mi, H., Yu, D. Scaling Synthetic Data Creation with 1,000,000,000 Personas (PersonaHub). arXiv:2406.20094, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Quantifying the Persona Effect in LLM Simulations.ACL, 2024

Hu, T., Collier, N. Quantifying the Persona Effect in LLM Simulations.ACL, 2024

2024

[10] [10]

Stereotype or Personalization? User Identity Biases Chatbot Recommendations.ACL Findings, 2025

Kantharuban, A., Milbauer, J., Sap, M., Strubell, E., Neubig, G. Stereotype or Personalization? User Identity Biases Chatbot Recommendations.ACL Findings, 2025. arXiv:2410.05613

work page arXiv 2025

[11] [11]

LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses.CHI, 2025

Lin, W., Gerchanovsky, A., Akgul, O., Bauer, L., Fredrikson, M., Wang, Z. LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses.CHI, 2025. arXiv:2406.04755

work page arXiv 2025

[12] [12]

State of What Art? A Call for Multi-Prompt LLM Evaluation.Transactions of the Association for Computational Linguistics (TACL), 2024

Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.Transactions of the Association for Computational Linguistics (TACL), 2024

2024

[13] [13]

Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv:2506.07295, 2025

Li, L., Sleem, L., Gentile, N., Nichil, G., State, R. Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv:2506.07295, 2025

work page arXiv 2025

[14] [14]

J., Xie, X

Hu, Z., Lian, J., Xiao, Z., Xiong, M., Lei, Y ., Wang, T., Ding, K., Xiao, Z., Yuan, N. J., Xie, X. Population-Aligned Persona Generation for LLM-based Social Simulation. arXiv:2509.10127, 2025

work page arXiv 2025

[15] [15]

The prompt makes the person (a): A systematic evaluation of sociodemo- graphic persona prompting for large language models,

Lutz, M., Sen, I., Ahnert, G., Rogers, E., Strohmaier, M. The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models. Findings of EMNLP, 2025. arXiv:2507.16076

work page arXiv 2025

[16] [16]

Auditing Preferences for Brands and Cultures in LLMs

Rienecker, J., Mpofu, K., Goel, N., Datta, S., Zhao, J., Danielsson, O., Thorsen, F. Auditing Preferences for Brands and Cultures in LLMs (ChoiceEval). arXiv:2603.18300, 2026

work page arXiv 2026

[17] [17]

M., Buchholz, A., Schwöbel, P

Lichtenberg, J. M., Buchholz, A., Schwöbel, P. Large Language Models as Recommender Systems: A Study of Popularity Bias. arXiv:2406.01285, 2024

work page arXiv 2024

[18] [18]

Are Emergent Abilities of Large Language Models a Mirage?NeurIPS, 2023

Schaeffer, R., Miranda, B., Koyejo, S. Are Emergent Abilities of Large Language Models a Mirage?NeurIPS, 2023

2023

[19] [19]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design.ICLR, 2024. arXiv:2310.11324

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

2023

[21] [21]

X., Liu, J., Wu, H., Wang, H

Wang, Y ., Ren, R., Wang, Y ., Zhao, W. X., Liu, J., Wu, H., Wang, H. Unveiling Knowl- edge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation.SIGIR, 2025. arXiv:2505.11995

work page arXiv 2025

[22] [22]

Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation.PNAS Nexus, 4(3), 2025

An, J., Huang, D., Lin, C., Tai, M. Measuring Gender and Racial Biases in Large Language Models: Intersectional Evidence from Automated Resume Evaluation.PNAS Nexus, 4(3), 2025. DOI: 10.1093/pnasnexus/pgaf089

work page doi:10.1093/pnasnexus/pgaf089 2025

[23] [23]

Z., Raghunathan, A

Goyal, S., Baek, C., Kolter, J. Z., Raghunathan, A. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance.ICLR, 2025. arXiv:2410.10796

work page arXiv 2025

[24] [24]

Knowledge conflicts for LLMs: A survey,

Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y ., Xu, W. Knowledge Conflicts for LLMs: A Survey.EMNLP, 2024. arXiv:2403.08319

work page arXiv 2024

[25] [25]

Gender and Race Bias in Consumer Product Recommendations by Large Language Models.AINA 2025, Lecture Notes in Networks and Systems vol

Xu, K., Potka, S., Thomo, A. Gender and Race Bias in Consumer Product Recommendations by Large Language Models.AINA 2025, Lecture Notes in Networks and Systems vol. 1210, Springer, 2025. arXiv:2602.08124

work page arXiv 2025

[26] [26]

A Helpful Assistant

Zheng, M., Pei, J., Logeswaran, L., Lee, M., Jurgens, D. When “A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models.EMNLP Findings, 2024. 12

2024