Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

Keller Maloney; Noah Lehman; Sarah Xu; Will Jack

arxiv: 2605.27440 · v1 · pith:WKSTUGW5new · submitted 2026-05-22 · 💻 cs.IR · cs.AI

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

Will Jack , Noah Lehman , Keller Maloney , Sarah Xu This is my paper

Pith reviewed 2026-06-30 14:41 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords paraphrase brittlenessretrieval-augmented recommendationAI visibility trackingprompt sensitivityJaccard similaritybrand recommendationreproducibility

0 comments

The pith

Prompt wording, not buyer intent, drives which brands AI assistants recommend.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that small rewordings of the same purchase question produce substantially different brand recommendations from AI systems. Across thousands of runs, similarity between paraphrases of one intent measures far lower by Jaccard index than similarity between repeated identical prompts. This gap persists even with added reasoning effort. As a result, counting brand mentions across a fixed prompt list captures phrasing artifacts more than stable model behavior toward any brand. The finding questions the stability of prompt-based visibility metrics used in commercial AI optimization.

Core claim

Small changes to how a buyer phrases a question produce substantially different brand recommendations from AI assistants. The recommendation-set similarity between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings and 0.135 for constraint-adding rewordings, both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap.

What carries the argument

Jaccard similarity of recommendation sets, measured between paraphrase pairs versus same-prompt rerun controls.

If this is right

Prompt-by-prompt mention tracking is structurally unstable as a unit of measurement.
Sampling more paraphrases per intent can reduce the artifact in principle.
The natural buyer-phrasing space exceeds the scale of current benchmark prompt sets used in evaluation methods.
Meaningful improvement requires a different unit of measurement rather than larger prompt sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sensitivity may appear in other retrieval-augmented tasks such as search result ordering.
Stabilizing outputs across equivalent intents could become a design target for recommendation models.
Commercial visibility trackers may need intent-level aggregation methods that have been validated beyond small prompt sets.

Load-bearing premise

The same-prompt rerun baseline isolates model-intrinsic stability so that lower paraphrase similarity can be attributed specifically to linguistic variation.

What would settle it

Observing Jaccard similarities for paraphrases that fall inside or above the 0.50-0.61 same-prompt rerun range would show the claimed dominance of prompt string does not hold.

read the original abstract

Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies lower Jaccard overlap for paraphrased prompts than same-prompt reruns in commercial AI brand recommendations, but the abstract leaves the controls too vague to pin the gap on wording alone.

read the letter

The main takeaway is that recommendation sets from different phrasings of the same buyer intent overlap only 0.135-0.288 by Jaccard, well below the 0.50-0.61 range for identical prompts rerun on the same models. They collected roughly 6000 runs in each arm on OpenAI and Anthropic systems and report clustered confidence intervals.

What is new is the scale of the comparison in a production commercial setting and the explicit use of the rerun baseline to benchmark stability. The paper does a clean job of spelling out the practical consequence for AEO/GEO trackers that rely on fixed prompt sets.

The soft spots are exactly where the stress-test note flags them. The abstract supplies no information on paraphrase construction, temperature, top_p, model snapshot, timing, or any other session variables that might have differed between the two conditions. Without those controls documented, the gap cannot be cleanly attributed to linguistic variation rather than protocol differences. The post-hoc pooling across axes is also presented without justification. The claim that extra reasoning effort does not close the gap inherits the same ambiguity.

This work is aimed at people who build or evaluate visibility metrics for generative recommendation systems. A reader who cares about prompt sensitivity or commercial RAG would get a useful data point from it.

It deserves a serious referee to examine the methods section and see whether the controls actually hold up.

Referee Report

3 major / 1 minor

Summary. The paper claims that small phrasing changes in buyer queries (cosmetic rewordings or constraint-adding) yield recommendation sets whose Jaccard overlap is only 0.135–0.288 (with 95% CIs), far below the 0.50–0.61 obtained from same-prompt reruns, across ~6000 runs each on OpenAI and Anthropic models. It concludes that the literal prompt string dominates over underlying buyer intent, rendering prompt-by-prompt mention counting structurally unstable as a visibility metric.

Significance. If the comparison is robust, the result would directly undermine current AEO/GEO evaluation practices that rely on fixed prompt sets and would motivate alternative units of measurement. The reported scale (~12 000 total runs), concrete Jaccard values, and confidence intervals constitute a concrete empirical contribution that can be checked against the rerun baseline.

major comments (3)

[Abstract] Abstract: the claim that the Jaccard gap is attributable to paraphrase differences rather than other factors rests on the assumption that the same-prompt rerun baseline holds all non-linguistic variables fixed, yet the abstract supplies no information on whether temperature, top_p, model snapshot, request metadata, timing, or session state were identical across conditions.
[Abstract] Abstract: post-hoc pooling across region/language and specificity-ladder axes is performed without reported justification, separate per-axis statistics, or a pre-specified analysis plan, so the pooled intervals [0.215, 0.361] and [0.098, 0.175] cannot be interpreted as direct evidence for the central claim.
[Abstract] Abstract: the statement that 'increasing reasoning effort does not narrow the gap (bounded by +/-0.05)' inherits the same control ambiguity and additionally lacks any description of how reasoning effort was operationalized or varied.

minor comments (1)

[Abstract] The abstract does not describe the paraphrase-generation procedure, exclusion rules, or exact model versions used, which would be needed for reproducibility even if the control issue is resolved.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below with clarifications from the full manuscript and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the Jaccard gap is attributable to paraphrase differences rather than other factors rests on the assumption that the same-prompt rerun baseline holds all non-linguistic variables fixed, yet the abstract supplies no information on whether temperature, top_p, model snapshot, request metadata, timing, or session state were identical across conditions.

Authors: All non-linguistic variables were held fixed across conditions: temperature=0, top_p=1.0, identical model snapshots, sequential requests with no session state or metadata variation, and timing within the same batch. These controls are described in Section 3.2. We will revise the abstract to state explicitly that the rerun baseline holds these variables constant. revision: yes
Referee: [Abstract] Abstract: post-hoc pooling across region/language and specificity-ladder axes is performed without reported justification, separate per-axis statistics, or a pre-specified analysis plan, so the pooled intervals [0.215, 0.361] and [0.098, 0.175] cannot be interpreted as direct evidence for the central claim.

Authors: The pooling summarizes the dominant pattern across axes; we agree separate per-axis statistics improve transparency. A revision will add these in an expanded Table 2 along with a methods justification for the pooled estimate. As the study was exploratory and not pre-registered, we will note the post-hoc nature as a limitation. revision: partial
Referee: [Abstract] Abstract: the statement that 'increasing reasoning effort does not narrow the gap (bounded by +/-0.05)' inherits the same control ambiguity and additionally lacks any description of how reasoning effort was operationalized or varied.

Authors: Reasoning effort was operationalized via chain-of-thought prompting variants and higher-reasoning model configurations (detailed in Section 4.3); the observed gap remained bounded by +/-0.05 under these conditions with the same controls as the main experiments. We will revise the abstract to include a brief description of the operationalization. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparison to external rerun baseline

full rationale

The paper reports measured Jaccard overlaps from ~12,000 experimental runs (paraphrase vs. same-prompt rerun controls) on commercial models. No equations, fitted parameters, or derivations appear in the provided text. The central comparison (0.135–0.288 vs. 0.50–0.61) is a direct empirical difference against an independently executed rerun baseline; it does not reduce to any self-referential quantity or self-citation chain. Self-citation load-bearing, ansatz smuggling, and renaming-known-result patterns are absent. The study is self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical data collection and standard statistical assumptions rather than new axioms or parameters.

axioms (2)

domain assumption Statistical independence of model runs for CI calculation
Invoked implicitly for the reported 95% CIs on Jaccard values.
domain assumption Jaccard similarity is a suitable measure of recommendation-set stability
Used as the primary comparison metric without further justification in the abstract.

pith-pipeline@v0.9.1-grok · 5889 in / 1447 out tokens · 65430 ms · 2026-06-30T14:41:25.678109+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
[2]

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b
[3]

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
[4]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024
[5]

LLM Stability: A Detailed Analysis with Some Surprises

Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., Baldwin, B. LLM Stability: A Detailed Analysis with Some Surprises. arXiv:2408.04667, 2024

work page arXiv 2024
[6]

C., Ng, H

Gan, W. C., Ng, H. T. Improving the Robustness of Question Answering Systems to Question Paraphrasing.ACL, 2019. ACL Anthology P19-1610

2019
[7]

Chatterjee, A., Renduchintala, H. S. V . N. S. K., Bhatia, S., Chakraborty, T. POSIX: A Prompt Sensitivity Index For Large Language Models.EMNLP Findings, 2024. arXiv:2410.02185

work page arXiv 2024
[8]

S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U

Perçin, S., Su, X., Syed, Q. S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U. Investi- gating the Robustness of Retrieval-Augmented Generation at the Query Level.ACL Workshop on Generation, Evaluation & Metrics, 2025. arXiv:2507.06956

work page arXiv 2025
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

Yuan, J., Li, H., Ding, X., Xie, W., Li, Y .-J., Zhao, W., Wan, K., Shi, J., Hu, X., Liu, Z. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. arXiv:2506.09501, 2025

work page arXiv 2025
[11]

Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023

Ma, X., Gong, Y ., He, P., Zhao, H., Duan, N. Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023. arXiv:2305.14283. 10

work page arXiv 2023
[12]

Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Meincke, L., Mollick, E., Mollick, L., Shapiro, D. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. arXiv:2506.07142, 2025

work page arXiv 2025
[13]

State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024

Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024

2024
[14]

M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A

Polo, F. M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A. F. M., Sun, Y ., Yurochkin, M. Efficient Multi-Prompt Evaluation of LLMs.NeurIPS, 2024

2024
[15]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.ICLR, 2024. arXiv:2310.11324

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models.ICLR, 2023. arXiv:2203.11171. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

[2] [2]

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

[3] [3]

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

[4] [4]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024

[5] [5]

LLM Stability: A Detailed Analysis with Some Surprises

Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., Baldwin, B. LLM Stability: A Detailed Analysis with Some Surprises. arXiv:2408.04667, 2024

work page arXiv 2024

[6] [6]

C., Ng, H

Gan, W. C., Ng, H. T. Improving the Robustness of Question Answering Systems to Question Paraphrasing.ACL, 2019. ACL Anthology P19-1610

2019

[7] [7]

Chatterjee, A., Renduchintala, H. S. V . N. S. K., Bhatia, S., Chakraborty, T. POSIX: A Prompt Sensitivity Index For Large Language Models.EMNLP Findings, 2024. arXiv:2410.02185

work page arXiv 2024

[8] [8]

S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U

Perçin, S., Su, X., Syed, Q. S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U. Investi- gating the Robustness of Retrieval-Augmented Generation at the Query Level.ACL Workshop on Generation, Evaluation & Metrics, 2025. arXiv:2507.06956

work page arXiv 2025

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

Yuan, J., Li, H., Ding, X., Xie, W., Li, Y .-J., Zhao, W., Wan, K., Shi, J., Hu, X., Liu, Z. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. arXiv:2506.09501, 2025

work page arXiv 2025

[11] [11]

Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023

Ma, X., Gong, Y ., He, P., Zhao, H., Duan, N. Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023. arXiv:2305.14283. 10

work page arXiv 2023

[12] [12]

Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Meincke, L., Mollick, E., Mollick, L., Shapiro, D. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. arXiv:2506.07142, 2025

work page arXiv 2025

[13] [13]

State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024

Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024

2024

[14] [14]

M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A

Polo, F. M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A. F. M., Sun, Y ., Yurochkin, M. Efficient Multi-Prompt Evaluation of LLMs.NeurIPS, 2024

2024

[15] [15]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.ICLR, 2024. arXiv:2310.11324

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models.ICLR, 2023. arXiv:2203.11171. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023