Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation

Keller Maloney; Noah Lehman; Sarah Xu; Will Jack

arxiv: 2606.26116 · v1 · pith:DLYSSMWLnew · submitted 2026-05-22 · 💻 cs.CY · cs.AI

Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation

Will Jack , Noah Lehman , Keller Maloney , Sarah Xu This is my paper

Pith reviewed 2026-06-30 14:40 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords AI recommendationsfailure modescross-provider agreementbrand visibilitylong-tail brandsChatGPTClaudecommercial prompts

0 comments

The pith

ChatGPT and Claude disagree on which brands to recommend two-thirds of the time but agree on the failure reason 95.1 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether two major AI providers can share one optimization strategy for commercial recommendations or require separate ones. It measures recommendation overlap across hundreds of prompts and finds low agreement on which brands appear. When both providers omit the same brand, however, researchers classify the omission into one of three failure modes and observe near-identical classifications in over 95 percent of cases. The match strengthens as brand prominence falls, reaching 99.6 percent for long-tail regional brands. This pattern implies that fixes aimed at a shared failure mode can raise visibility for both systems at once, at least outside the top brands.

Core claim

Across 215 commercially framed prompts run in four batches, the two providers produce overlapping brand lists only about one-third of the time. On the 7,763 occasions when neither recommends a given brand, independent classification into discoverability, compellingness, or positioning yields the same label 95.1 percent of the time. Agreement rises monotonically from 81 percent on category leaders to 99.6 percent on long-tail brands. The providers reach recommendations through measurably different generative routes yet converge on the same diagnostic label when a brand is missed.

What carries the argument

Three failure-mode categories—discoverability (brand never reaches the model), compellingness (brand reaches the model but is not mentioned), and positioning (brand is mentioned but not recommended)—applied to every joint non-recommendation.

If this is right

Fixes that target a diagnosed failure mode raise brand visibility on both providers simultaneously.
A single optimization playbook suffices for long-tail regional brands.
Category-leader brands require provider-specific work on positioning and content.
The convergence on failure diagnosis occurs even though the providers generate recommendations from different internal routes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared diagnostic categories may reflect common patterns in how large language models encode commercial knowledge.
Firms selling into the long tail could prioritize one set of content changes rather than maintaining separate roadmaps.
Future work could test whether the same three categories apply when more than two providers are compared.

Load-bearing premise

The three failure-mode labels can be assigned to each omitted brand in a way that does not depend on the individual researcher or prompt wording.

What would settle it

Independent coders classifying the same set of 7,763 joint failures and obtaining agreement below 85 percent would indicate that the reported 95.1 percent convergence rests on subjective labeling.

read the original abstract

A brand whose customers use both ChatGPT and Claude for product recommendations faces a strategic choice: a single optimization playbook, or one per provider? Across 215 commercially-framed prompts in four measurement batches, the two providers disagree on which brands they recommend roughly two-thirds of the time (cross-provider recommendation Jaccard 0.35, below the 0.50-0.61 same-prompt rerun baseline). The picks diverge. But when neither provider recommends a brand, we classify the failure into one of three modes -- discoverability (the brand never reaches the model), compellingness (it reaches the model but isn't mentioned), or positioning (it's mentioned but not recommended) -- and on 7,763 such joint failures, both providers diagnose the same failure mode 95.1% of the time (clustered 95% CI [94.3%, 95.7%]). Agreement rises monotonically with falling brand prominence, from 81% [78.2%, 84.0%] on category leaders to 99.6% [99.3%, 99.9%] on long-tail regional brands. The two providers reach their picks by measurably different generative routes -- Anthropic recommends from priors 43-52% of the time, OpenAI 8-29% -- but they converge on the failure diagnosis where it matters most for the long tail. Work that addresses the diagnosed failure mode lifts visibility on both providers; positioning - and content-level work for category leaders is more provider-specific.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports that ChatGPT and Claude diverge on brand recommendations but agree 95% on failure modes for the same misses, with the agreement highest on long-tail items, yet the mode labels have no reported validation.

read the letter

The central observation is that the two providers disagree on roughly two-thirds of their brand picks but converge on the same failure diagnosis for the 7,763 joint misses at 95.1 percent, rising to 99.6 percent for long-tail regional brands.

The work runs 215 commercial prompts in batches, measures recommendation overlap with a Jaccard index of 0.35 against a same-prompt rerun baseline of 0.50-0.61, and bins the misses into discoverability, compellingness, or positioning. It also notes that the providers reach their outputs through different routes, with Anthropic relying on priors more often. These counts and the monotonic trend with brand prominence are the concrete new pieces.

The classification step is the clear soft spot. The abstract supplies no decision rules, edge cases, blinding protocol, or inter-rater numbers for assigning the 7,763 failures to the three modes. The stress-test concern is accurate on the evidence given: once the authors apply the labels, the reported agreement follows directly from those assignments rather than from an independent property of the models. Without that protocol, the 95.1 percent figure cannot be taken at face value.

This is for teams that optimize brand visibility across LLM platforms or study cross-model consistency in applied settings. A reader who wants raw divergence numbers might pull something useful; anyone who needs to act on the failure-mode claim will need the missing methods details first.

I would send it to peer review so the labeling process can be examined, but the current version rests on an unverified step that directly supports the headline result.

Referee Report

2 major / 2 minor

Summary. The paper reports that ChatGPT and Claude diverge substantially in brand recommendations across 215 prompts (cross-provider Jaccard index 0.35 vs. 0.50-0.61 same-prompt baseline), yet on 7,763 joint failures they agree on the assigned failure mode (discoverability, compellingness, or positioning) 95.1% of the time (clustered 95% CI [94.3%, 95.7%]), with agreement rising to 99.6% for long-tail brands. The authors conclude that providers converge on diagnoses even when their generative routes differ (e.g., prior-based recommendations 43-52% for Anthropic vs. 8-29% for OpenAI).

Significance. If the three-mode taxonomy can be applied consistently, the result indicates that long-tail visibility work can be shared across providers while category-leader positioning remains more provider-specific. The analysis rests on direct empirical counts and Jaccard indices from prompt runs rather than fitted parameters or circular derivations.

major comments (2)

[Abstract] Abstract: the 95.1% agreement figure is obtained only after the authors classify each of the 7,763 joint failures into discoverability/compellingness/positioning. No decision rules, edge-case examples, blinding protocol, or inter-rater reliability statistics are supplied, so it is impossible to assess whether the reported convergence is independent of the labeling step.
[Abstract] Abstract and presumed Methods: the claim that 'work that addresses the diagnosed failure mode lifts visibility on both providers' is presented as a conclusion, yet the manuscript provides no before/after measurements or controlled interventions demonstrating this lift for the three modes.

minor comments (2)

[Abstract] Abstract: the same-prompt rerun baseline Jaccard range (0.50-0.61) is cited without stating the number of reruns per prompt or how variance was estimated.
[Abstract] Abstract: the four measurement batches are mentioned but not characterized (e.g., temporal separation, prompt sampling method).

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and outline revisions to improve transparency and accuracy.

read point-by-point responses

Referee: [Abstract] Abstract: the 95.1% agreement figure is obtained only after the authors classify each of the 7,763 joint failures into discoverability/compellingness/positioning. No decision rules, edge-case examples, blinding protocol, or inter-rater reliability statistics are supplied, so it is impossible to assess whether the reported convergence is independent of the labeling step.

Authors: We agree the manuscript does not currently supply explicit decision rules, edge-case examples, blinding details, or inter-rater statistics for the failure-mode classification. The three modes were applied using operational definitions in the Methods (discoverability: brand absent from model knowledge; compellingness: known but unmentioned; positioning: mentioned but not recommended). To resolve this, we will add a dedicated subsection with formal decision rules, three annotated edge cases per mode, and Cohen's kappa from a blinded second-rater re-labeling of a 500-failure subsample. This addition will allow independent evaluation of whether the 95.1% agreement is robust to labeling choices. revision: yes
Referee: [Abstract] Abstract and presumed Methods: the claim that 'work that addresses the diagnosed failure mode lifts visibility on both providers' is presented as a conclusion, yet the manuscript provides no before/after measurements or controlled interventions demonstrating this lift for the three modes.

Authors: The statement is an inference drawn from the cross-provider convergence in diagnoses combined with the documented differences in generative routes. No before/after measurements or intervention experiments appear in the manuscript. We will revise the abstract and conclusion to present the claim as a hypothesis for future work rather than a demonstrated result, changing the wording to indicate that such work 'is expected to' or 'may' lift visibility on both providers while explicitly noting the absence of direct empirical tests. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical counts from prompt runs and classifications

full rationale

The paper's central results consist of observed recommendation Jaccard indices (0.35) and a direct proportion of agreement (95.1%) on failure-mode labels assigned to 7,763 joint failures. These are computed from prompt executions and researcher-applied categories with no equations, fitted parameters, or self-citations that reduce the reported figures to inputs by construction. The derivation chain is observational measurement rather than any self-referential or tautological step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that the three failure modes form an exhaustive and mutually exclusive classification scheme that can be applied consistently to model outputs.

axioms (1)

domain assumption The three failure modes (discoverability, compellingness, positioning) are exhaustive and mutually exclusive for cases where neither provider recommends a brand.
Invoked to classify the 7,763 joint failures and compute the 95.1% agreement rate.

invented entities (1)

Discoverability / compellingness / positioning failure-mode taxonomy no independent evidence
purpose: To label why a brand is not recommended by a given provider
Defined within the paper to enable the cross-provider comparison; no independent evidence outside this study is provided.

pith-pipeline@v0.9.1-grok · 5814 in / 1359 out tokens · 40587 ms · 2026-06-30T14:40:30.556698+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
[2]

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b

Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b
[3]

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
[4]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024
[5]

A., Kumar, A., Jurafsky, D., Liang, P

Bommasani, R., Creel, K. A., Kumar, A., Jurafsky, D., Liang, P. Picking on the Same Per- son: Does Algorithmic Monoculture Lead to Outcome Homogenization?NeurIPS, 2022. arXiv:2211.13972

work page arXiv 2022
[6]

Court, D., Elzinga, D., Mulder, S., Vetvik, O. J. The Consumer Decision Journey.McKinsey Quarterly, 2009

2009
[7]

Liang, P. et al. Holistic Evaluation of Language Models.Transactions on Machine Learning Research, 2023. arXiv:2211.09110. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

2023
[9]

A Comprehensive Taxonomy of Hallucinations in Large Language Models

Cossio, M. A Comprehensive Taxonomy of Hallucinations in Large Language Models. arXiv:2508.01781, 2025

work page arXiv 2025
[10]

F., Ilhan, F., Huang, T., Hu, S., Liu, L

Tekin, S. F., Ilhan, F., Huang, T., Hu, S., Liu, L. LLM-TOPLA: Efficient LLM Ensemble by Maximizing Diversity.Findings of EMNLP, 2024

2024
[11]

R., Cicchetti, D

Feinstein, A. R., Cicchetti, D. V . High agreement but low kappa: I. The problems of two paradoxes.Journal of Clinical Epidemiology, 43(6):543–549, 1990

1990
[12]

Z., Raghunathan, A

Goyal, S., Baek, C., Kolter, J. Z., Raghunathan, A. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance.ICLR, 2025. arXiv:2410.10796

work page arXiv 2025
[13]

Knowledge conflicts for LLMs: A survey,

Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y ., Xu, W. Knowledge Conflicts for LLMs: A Survey.EMNLP, 2024. arXiv:2403.08319

work page arXiv 2024
[14]

News Source Citing Patterns in AI Search Systems

Yang, K.-C. News Source Citing Patterns in AI Search Systems. arXiv:2507.05301, 2025

work page arXiv 2025
[15]

Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023. arXiv:2306.05685. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a

[2] [2]

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b

Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b

[3] [3]

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c

[4] [4]

GEO: Generative Engine Optimization.KDD ’24, 2024

Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735

work page arXiv 2024

[5] [5]

A., Kumar, A., Jurafsky, D., Liang, P

Bommasani, R., Creel, K. A., Kumar, A., Jurafsky, D., Liang, P. Picking on the Same Per- son: Does Algorithmic Monoculture Lead to Outcome Homogenization?NeurIPS, 2022. arXiv:2211.13972

work page arXiv 2022

[6] [6]

Court, D., Elzinga, D., Mulder, S., Vetvik, O. J. The Consumer Decision Journey.McKinsey Quarterly, 2009

2009

[7] [7]

Liang, P. et al. Holistic Evaluation of Language Models.Transactions on Machine Learning Research, 2023. arXiv:2211.09110. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023

2023

[9] [9]

A Comprehensive Taxonomy of Hallucinations in Large Language Models

Cossio, M. A Comprehensive Taxonomy of Hallucinations in Large Language Models. arXiv:2508.01781, 2025

work page arXiv 2025

[10] [10]

F., Ilhan, F., Huang, T., Hu, S., Liu, L

Tekin, S. F., Ilhan, F., Huang, T., Hu, S., Liu, L. LLM-TOPLA: Efficient LLM Ensemble by Maximizing Diversity.Findings of EMNLP, 2024

2024

[11] [11]

R., Cicchetti, D

Feinstein, A. R., Cicchetti, D. V . High agreement but low kappa: I. The problems of two paradoxes.Journal of Clinical Epidemiology, 43(6):543–549, 1990

1990

[12] [12]

Z., Raghunathan, A

Goyal, S., Baek, C., Kolter, J. Z., Raghunathan, A. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance.ICLR, 2025. arXiv:2410.10796

work page arXiv 2025

[13] [13]

Knowledge conflicts for LLMs: A survey,

Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y ., Xu, W. Knowledge Conflicts for LLMs: A Survey.EMNLP, 2024. arXiv:2403.08319

work page arXiv 2024

[14] [14]

News Source Citing Patterns in AI Search Systems

Yang, K.-C. News Source Citing Patterns in AI Search Systems. arXiv:2507.05301, 2025

work page arXiv 2025

[15] [15]

Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023. arXiv:2306.05685. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023