Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline
Pith reviewed 2026-06-30 14:41 UTC · model grok-4.3
The pith
Prompt wording, not buyer intent, drives which brands AI assistants recommend.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Small changes to how a buyer phrases a question produce substantially different brand recommendations from AI assistants. The recommendation-set similarity between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings and 0.135 for constraint-adding rewordings, both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap.
What carries the argument
Jaccard similarity of recommendation sets, measured between paraphrase pairs versus same-prompt rerun controls.
If this is right
- Prompt-by-prompt mention tracking is structurally unstable as a unit of measurement.
- Sampling more paraphrases per intent can reduce the artifact in principle.
- The natural buyer-phrasing space exceeds the scale of current benchmark prompt sets used in evaluation methods.
- Meaningful improvement requires a different unit of measurement rather than larger prompt sets.
Where Pith is reading between the lines
- The same sensitivity may appear in other retrieval-augmented tasks such as search result ordering.
- Stabilizing outputs across equivalent intents could become a design target for recommendation models.
- Commercial visibility trackers may need intent-level aggregation methods that have been validated beyond small prompt sets.
Load-bearing premise
The same-prompt rerun baseline isolates model-intrinsic stability so that lower paraphrase similarity can be attributed specifically to linguistic variation.
What would settle it
Observing Jaccard similarities for paraphrases that fall inside or above the 0.50-0.61 same-prompt rerun range would show the claimed dominance of prompt string does not hold.
read the original abstract
Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that small phrasing changes in buyer queries (cosmetic rewordings or constraint-adding) yield recommendation sets whose Jaccard overlap is only 0.135–0.288 (with 95% CIs), far below the 0.50–0.61 obtained from same-prompt reruns, across ~6000 runs each on OpenAI and Anthropic models. It concludes that the literal prompt string dominates over underlying buyer intent, rendering prompt-by-prompt mention counting structurally unstable as a visibility metric.
Significance. If the comparison is robust, the result would directly undermine current AEO/GEO evaluation practices that rely on fixed prompt sets and would motivate alternative units of measurement. The reported scale (~12 000 total runs), concrete Jaccard values, and confidence intervals constitute a concrete empirical contribution that can be checked against the rerun baseline.
major comments (3)
- [Abstract] Abstract: the claim that the Jaccard gap is attributable to paraphrase differences rather than other factors rests on the assumption that the same-prompt rerun baseline holds all non-linguistic variables fixed, yet the abstract supplies no information on whether temperature, top_p, model snapshot, request metadata, timing, or session state were identical across conditions.
- [Abstract] Abstract: post-hoc pooling across region/language and specificity-ladder axes is performed without reported justification, separate per-axis statistics, or a pre-specified analysis plan, so the pooled intervals [0.215, 0.361] and [0.098, 0.175] cannot be interpreted as direct evidence for the central claim.
- [Abstract] Abstract: the statement that 'increasing reasoning effort does not narrow the gap (bounded by +/-0.05)' inherits the same control ambiguity and additionally lacks any description of how reasoning effort was operationalized or varied.
minor comments (1)
- [Abstract] The abstract does not describe the paraphrase-generation procedure, exclusion rules, or exact model versions used, which would be needed for reproducibility even if the control issue is resolved.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below with clarifications from the full manuscript and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the Jaccard gap is attributable to paraphrase differences rather than other factors rests on the assumption that the same-prompt rerun baseline holds all non-linguistic variables fixed, yet the abstract supplies no information on whether temperature, top_p, model snapshot, request metadata, timing, or session state were identical across conditions.
Authors: All non-linguistic variables were held fixed across conditions: temperature=0, top_p=1.0, identical model snapshots, sequential requests with no session state or metadata variation, and timing within the same batch. These controls are described in Section 3.2. We will revise the abstract to state explicitly that the rerun baseline holds these variables constant. revision: yes
-
Referee: [Abstract] Abstract: post-hoc pooling across region/language and specificity-ladder axes is performed without reported justification, separate per-axis statistics, or a pre-specified analysis plan, so the pooled intervals [0.215, 0.361] and [0.098, 0.175] cannot be interpreted as direct evidence for the central claim.
Authors: The pooling summarizes the dominant pattern across axes; we agree separate per-axis statistics improve transparency. A revision will add these in an expanded Table 2 along with a methods justification for the pooled estimate. As the study was exploratory and not pre-registered, we will note the post-hoc nature as a limitation. revision: partial
-
Referee: [Abstract] Abstract: the statement that 'increasing reasoning effort does not narrow the gap (bounded by +/-0.05)' inherits the same control ambiguity and additionally lacks any description of how reasoning effort was operationalized or varied.
Authors: Reasoning effort was operationalized via chain-of-thought prompting variants and higher-reasoning model configurations (detailed in Section 4.3); the observed gap remained bounded by +/-0.05 under these conditions with the same controls as the main experiments. We will revise the abstract to include a brief description of the operationalization. revision: yes
Circularity Check
No circularity; purely empirical comparison to external rerun baseline
full rationale
The paper reports measured Jaccard overlaps from ~12,000 experimental runs (paraphrase vs. same-prompt rerun controls) on commercial models. No equations, fitted parameters, or derivations appear in the provided text. The central comparison (0.135–0.288 vs. 0.50–0.61) is a direct empirical difference against an independently executed rerun baseline; it does not reduce to any self-referential quantity or self-citation chain. Self-citation load-bearing, ansatz smuggling, and renaming-known-result patterns are absent. The study is self-contained against its stated external benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Statistical independence of model runs for CI calculation
- domain assumption Jaccard similarity is a suitable measure of recommendation-set stability
Reference graph
Works this paper leans on
-
[1]
Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
-
[2]
Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b
Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b
-
[3]
Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
-
[4]
GEO: Generative Engine Optimization.KDD ’24, 2024
Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735
-
[5]
LLM Stability: A Detailed Analysis with Some Surprises
Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., Baldwin, B. LLM Stability: A Detailed Analysis with Some Surprises. arXiv:2408.04667, 2024
-
[6]
C., Ng, H
Gan, W. C., Ng, H. T. Improving the Robustness of Question Answering Systems to Question Paraphrasing.ACL, 2019. ACL Anthology P19-1610
2019
- [7]
-
[8]
S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U
Perçin, S., Su, X., Syed, Q. S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U. Investi- gating the Robustness of Retrieval-Augmented Generation at the Query Level.ACL Workshop on Generation, Evaluation & Metrics, 2025. arXiv:2507.06956
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference
Yuan, J., Li, H., Ding, X., Xie, W., Li, Y .-J., Zhao, W., Wan, K., Shi, J., Hu, X., Liu, Z. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. arXiv:2506.09501, 2025
-
[11]
Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023
Ma, X., Gong, Y ., He, P., Zhao, H., Duan, N. Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023. arXiv:2305.14283. 10
-
[12]
Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting
Meincke, L., Mollick, E., Mollick, L., Shapiro, D. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. arXiv:2506.07142, 2025
-
[13]
State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024
Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024
2024
-
[14]
M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A
Polo, F. M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A. F. M., Sun, Y ., Yurochkin, M. Efficient Multi-Prompt Evaluation of LLMs.NeurIPS, 2024
2024
-
[15]
Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.ICLR, 2024. arXiv:2310.11324
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models.ICLR, 2023. arXiv:2203.11171. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.