pith. sign in

arxiv: 2605.27440 · v1 · pith:WKSTUGW5new · submitted 2026-05-22 · 💻 cs.IR · cs.AI

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

Pith reviewed 2026-06-30 14:41 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords paraphrase brittlenessretrieval-augmented recommendationAI visibility trackingprompt sensitivityJaccard similaritybrand recommendationreproducibility
0
0 comments X

The pith

Prompt wording, not buyer intent, drives which brands AI assistants recommend.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that small rewordings of the same purchase question produce substantially different brand recommendations from AI systems. Across thousands of runs, similarity between paraphrases of one intent measures far lower by Jaccard index than similarity between repeated identical prompts. This gap persists even with added reasoning effort. As a result, counting brand mentions across a fixed prompt list captures phrasing artifacts more than stable model behavior toward any brand. The finding questions the stability of prompt-based visibility metrics used in commercial AI optimization.

Core claim

Small changes to how a buyer phrases a question produce substantially different brand recommendations from AI assistants. The recommendation-set similarity between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings and 0.135 for constraint-adding rewordings, both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap.

What carries the argument

Jaccard similarity of recommendation sets, measured between paraphrase pairs versus same-prompt rerun controls.

If this is right

  • Prompt-by-prompt mention tracking is structurally unstable as a unit of measurement.
  • Sampling more paraphrases per intent can reduce the artifact in principle.
  • The natural buyer-phrasing space exceeds the scale of current benchmark prompt sets used in evaluation methods.
  • Meaningful improvement requires a different unit of measurement rather than larger prompt sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensitivity may appear in other retrieval-augmented tasks such as search result ordering.
  • Stabilizing outputs across equivalent intents could become a design target for recommendation models.
  • Commercial visibility trackers may need intent-level aggregation methods that have been validated beyond small prompt sets.

Load-bearing premise

The same-prompt rerun baseline isolates model-intrinsic stability so that lower paraphrase similarity can be attributed specifically to linguistic variation.

What would settle it

Observing Jaccard similarities for paraphrases that fall inside or above the 0.50-0.61 same-prompt rerun range would show the claimed dominance of prompt string does not hold.

read the original abstract

Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that small phrasing changes in buyer queries (cosmetic rewordings or constraint-adding) yield recommendation sets whose Jaccard overlap is only 0.135–0.288 (with 95% CIs), far below the 0.50–0.61 obtained from same-prompt reruns, across ~6000 runs each on OpenAI and Anthropic models. It concludes that the literal prompt string dominates over underlying buyer intent, rendering prompt-by-prompt mention counting structurally unstable as a visibility metric.

Significance. If the comparison is robust, the result would directly undermine current AEO/GEO evaluation practices that rely on fixed prompt sets and would motivate alternative units of measurement. The reported scale (~12 000 total runs), concrete Jaccard values, and confidence intervals constitute a concrete empirical contribution that can be checked against the rerun baseline.

major comments (3)
  1. [Abstract] Abstract: the claim that the Jaccard gap is attributable to paraphrase differences rather than other factors rests on the assumption that the same-prompt rerun baseline holds all non-linguistic variables fixed, yet the abstract supplies no information on whether temperature, top_p, model snapshot, request metadata, timing, or session state were identical across conditions.
  2. [Abstract] Abstract: post-hoc pooling across region/language and specificity-ladder axes is performed without reported justification, separate per-axis statistics, or a pre-specified analysis plan, so the pooled intervals [0.215, 0.361] and [0.098, 0.175] cannot be interpreted as direct evidence for the central claim.
  3. [Abstract] Abstract: the statement that 'increasing reasoning effort does not narrow the gap (bounded by +/-0.05)' inherits the same control ambiguity and additionally lacks any description of how reasoning effort was operationalized or varied.
minor comments (1)
  1. [Abstract] The abstract does not describe the paraphrase-generation procedure, exclusion rules, or exact model versions used, which would be needed for reproducibility even if the control issue is resolved.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below with clarifications from the full manuscript and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the Jaccard gap is attributable to paraphrase differences rather than other factors rests on the assumption that the same-prompt rerun baseline holds all non-linguistic variables fixed, yet the abstract supplies no information on whether temperature, top_p, model snapshot, request metadata, timing, or session state were identical across conditions.

    Authors: All non-linguistic variables were held fixed across conditions: temperature=0, top_p=1.0, identical model snapshots, sequential requests with no session state or metadata variation, and timing within the same batch. These controls are described in Section 3.2. We will revise the abstract to state explicitly that the rerun baseline holds these variables constant. revision: yes

  2. Referee: [Abstract] Abstract: post-hoc pooling across region/language and specificity-ladder axes is performed without reported justification, separate per-axis statistics, or a pre-specified analysis plan, so the pooled intervals [0.215, 0.361] and [0.098, 0.175] cannot be interpreted as direct evidence for the central claim.

    Authors: The pooling summarizes the dominant pattern across axes; we agree separate per-axis statistics improve transparency. A revision will add these in an expanded Table 2 along with a methods justification for the pooled estimate. As the study was exploratory and not pre-registered, we will note the post-hoc nature as a limitation. revision: partial

  3. Referee: [Abstract] Abstract: the statement that 'increasing reasoning effort does not narrow the gap (bounded by +/-0.05)' inherits the same control ambiguity and additionally lacks any description of how reasoning effort was operationalized or varied.

    Authors: Reasoning effort was operationalized via chain-of-thought prompting variants and higher-reasoning model configurations (detailed in Section 4.3); the observed gap remained bounded by +/-0.05 under these conditions with the same controls as the main experiments. We will revise the abstract to include a brief description of the operationalization. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparison to external rerun baseline

full rationale

The paper reports measured Jaccard overlaps from ~12,000 experimental runs (paraphrase vs. same-prompt rerun controls) on commercial models. No equations, fitted parameters, or derivations appear in the provided text. The central comparison (0.135–0.288 vs. 0.50–0.61) is a direct empirical difference against an independently executed rerun baseline; it does not reduce to any self-referential quantity or self-citation chain. Self-citation load-bearing, ansatz smuggling, and renaming-known-result patterns are absent. The study is self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical data collection and standard statistical assumptions rather than new axioms or parameters.

axioms (2)
  • domain assumption Statistical independence of model runs for CI calculation
    Invoked implicitly for the reported 95% CIs on Jaccard values.
  • domain assumption Jaccard similarity is a suitable measure of recommendation-set stability
    Used as the primary comparison metric without further justification in the abstract.

pith-pipeline@v0.9.1-grok · 5889 in / 1447 out tokens · 65430 ms · 2026-06-30T14:41:25.678109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

    Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

  2. [2]

    Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

    Jack, W. Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation.Unusual.ai Research Series, 2026b

  3. [3]

    Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

    Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

  4. [4]

    GEO: Generative Engine Optimization.KDD ’24, 2024

    Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

  5. [5]

    LLM Stability: A Detailed Analysis with Some Surprises

    Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., Baldwin, B. LLM Stability: A Detailed Analysis with Some Surprises. arXiv:2408.04667, 2024

  6. [6]

    C., Ng, H

    Gan, W. C., Ng, H. T. Improving the Robustness of Question Answering Systems to Question Paraphrasing.ACL, 2019. ACL Anthology P19-1610

  7. [7]

    Chatterjee, A., Renduchintala, H. S. V . N. S. K., Bhatia, S., Chakraborty, T. POSIX: A Prompt Sensitivity Index For Large Language Models.EMNLP Findings, 2024. arXiv:2410.02185

  8. [8]

    S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U

    Perçin, S., Su, X., Syed, Q. S., Howard, P., Kuvshinov, A., Schwinn, L., Scholl, K.-U. Investi- gating the Robustness of Retrieval-Augmented Generation at the Query Level.ACL Workshop on Generation, Evaluation & Metrics, 2025. arXiv:2507.06956

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025

  10. [10]

    Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

    Yuan, J., Li, H., Ding, X., Xie, W., Li, Y .-J., Zhao, W., Wan, K., Shi, J., Hu, X., Liu, Z. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. arXiv:2506.09501, 2025

  11. [11]

    Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023

    Ma, X., Gong, Y ., He, P., Zhao, H., Duan, N. Query Rewriting for Retrieval-Augmented Large Language Models.EMNLP, 2023. arXiv:2305.14283. 10

  12. [12]

    Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

    Meincke, L., Mollick, E., Mollick, L., Shapiro, D. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. arXiv:2506.07142, 2025

  13. [13]

    State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024

    Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D., Stanovsky, G. State of What Art? A Call for Multi-Prompt LLM Evaluation.TACL, 2024

  14. [14]

    M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A

    Polo, F. M., Xu, R., Weber, L., Silva, M., Bhardwaj, O., Choshen, L., de Oliveira, A. F. M., Sun, Y ., Yurochkin, M. Efficient Multi-Prompt Evaluation of LLMs.NeurIPS, 2024

  15. [15]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Sclar, M., Choi, Y ., Tsvetkov, Y ., Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.ICLR, 2024. arXiv:2310.11324

  16. [16]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models.ICLR, 2023. arXiv:2203.11171. 11