Divergent Recommendations, Convergent Diagnoses: Cross-Provider Failure-Mode Convergence in AI Commercial Recommendation
Pith reviewed 2026-06-30 14:40 UTC · model grok-4.3
The pith
ChatGPT and Claude disagree on which brands to recommend two-thirds of the time but agree on the failure reason 95.1 percent of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 215 commercially framed prompts run in four batches, the two providers produce overlapping brand lists only about one-third of the time. On the 7,763 occasions when neither recommends a given brand, independent classification into discoverability, compellingness, or positioning yields the same label 95.1 percent of the time. Agreement rises monotonically from 81 percent on category leaders to 99.6 percent on long-tail brands. The providers reach recommendations through measurably different generative routes yet converge on the same diagnostic label when a brand is missed.
What carries the argument
Three failure-mode categories—discoverability (brand never reaches the model), compellingness (brand reaches the model but is not mentioned), and positioning (brand is mentioned but not recommended)—applied to every joint non-recommendation.
If this is right
- Fixes that target a diagnosed failure mode raise brand visibility on both providers simultaneously.
- A single optimization playbook suffices for long-tail regional brands.
- Category-leader brands require provider-specific work on positioning and content.
- The convergence on failure diagnosis occurs even though the providers generate recommendations from different internal routes.
Where Pith is reading between the lines
- The shared diagnostic categories may reflect common patterns in how large language models encode commercial knowledge.
- Firms selling into the long tail could prioritize one set of content changes rather than maintaining separate roadmaps.
- Future work could test whether the same three categories apply when more than two providers are compared.
Load-bearing premise
The three failure-mode labels can be assigned to each omitted brand in a way that does not depend on the individual researcher or prompt wording.
What would settle it
Independent coders classifying the same set of 7,763 joint failures and obtaining agreement below 85 percent would indicate that the reported 95.1 percent convergence rests on subjective labeling.
read the original abstract
A brand whose customers use both ChatGPT and Claude for product recommendations faces a strategic choice: a single optimization playbook, or one per provider? Across 215 commercially-framed prompts in four measurement batches, the two providers disagree on which brands they recommend roughly two-thirds of the time (cross-provider recommendation Jaccard 0.35, below the 0.50-0.61 same-prompt rerun baseline). The picks diverge. But when neither provider recommends a brand, we classify the failure into one of three modes -- discoverability (the brand never reaches the model), compellingness (it reaches the model but isn't mentioned), or positioning (it's mentioned but not recommended) -- and on 7,763 such joint failures, both providers diagnose the same failure mode 95.1% of the time (clustered 95% CI [94.3%, 95.7%]). Agreement rises monotonically with falling brand prominence, from 81% [78.2%, 84.0%] on category leaders to 99.6% [99.3%, 99.9%] on long-tail regional brands. The two providers reach their picks by measurably different generative routes -- Anthropic recommends from priors 43-52% of the time, OpenAI 8-29% -- but they converge on the failure diagnosis where it matters most for the long tail. Work that addresses the diagnosed failure mode lifts visibility on both providers; positioning - and content-level work for category leaders is more provider-specific.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports that ChatGPT and Claude diverge substantially in brand recommendations across 215 prompts (cross-provider Jaccard index 0.35 vs. 0.50-0.61 same-prompt baseline), yet on 7,763 joint failures they agree on the assigned failure mode (discoverability, compellingness, or positioning) 95.1% of the time (clustered 95% CI [94.3%, 95.7%]), with agreement rising to 99.6% for long-tail brands. The authors conclude that providers converge on diagnoses even when their generative routes differ (e.g., prior-based recommendations 43-52% for Anthropic vs. 8-29% for OpenAI).
Significance. If the three-mode taxonomy can be applied consistently, the result indicates that long-tail visibility work can be shared across providers while category-leader positioning remains more provider-specific. The analysis rests on direct empirical counts and Jaccard indices from prompt runs rather than fitted parameters or circular derivations.
major comments (2)
- [Abstract] Abstract: the 95.1% agreement figure is obtained only after the authors classify each of the 7,763 joint failures into discoverability/compellingness/positioning. No decision rules, edge-case examples, blinding protocol, or inter-rater reliability statistics are supplied, so it is impossible to assess whether the reported convergence is independent of the labeling step.
- [Abstract] Abstract and presumed Methods: the claim that 'work that addresses the diagnosed failure mode lifts visibility on both providers' is presented as a conclusion, yet the manuscript provides no before/after measurements or controlled interventions demonstrating this lift for the three modes.
minor comments (2)
- [Abstract] Abstract: the same-prompt rerun baseline Jaccard range (0.50-0.61) is cited without stating the number of reruns per prompt or how variance was estimated.
- [Abstract] Abstract: the four measurement batches are mentioned but not characterized (e.g., temporal separation, prompt sampling method).
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below and outline revisions to improve transparency and accuracy.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 95.1% agreement figure is obtained only after the authors classify each of the 7,763 joint failures into discoverability/compellingness/positioning. No decision rules, edge-case examples, blinding protocol, or inter-rater reliability statistics are supplied, so it is impossible to assess whether the reported convergence is independent of the labeling step.
Authors: We agree the manuscript does not currently supply explicit decision rules, edge-case examples, blinding details, or inter-rater statistics for the failure-mode classification. The three modes were applied using operational definitions in the Methods (discoverability: brand absent from model knowledge; compellingness: known but unmentioned; positioning: mentioned but not recommended). To resolve this, we will add a dedicated subsection with formal decision rules, three annotated edge cases per mode, and Cohen's kappa from a blinded second-rater re-labeling of a 500-failure subsample. This addition will allow independent evaluation of whether the 95.1% agreement is robust to labeling choices. revision: yes
-
Referee: [Abstract] Abstract and presumed Methods: the claim that 'work that addresses the diagnosed failure mode lifts visibility on both providers' is presented as a conclusion, yet the manuscript provides no before/after measurements or controlled interventions demonstrating this lift for the three modes.
Authors: The statement is an inference drawn from the cross-provider convergence in diagnoses combined with the documented differences in generative routes. No before/after measurements or intervention experiments appear in the manuscript. We will revise the abstract and conclusion to present the claim as a hypothesis for future work rather than a demonstrated result, changing the wording to indicate that such work 'is expected to' or 'may' lift visibility on both providers while explicitly noting the absence of direct empirical tests. revision: yes
Circularity Check
No circularity: results are direct empirical counts from prompt runs and classifications
full rationale
The paper's central results consist of observed recommendation Jaccard indices (0.35) and a direct proportion of agreement (95.1%) on failure-mode labels assigned to 7,763 joint failures. These are computed from prompt executions and researcher-applied categories with no equations, fitted parameters, or self-citations that reduce the reported figures to inputs by construction. The derivation chain is observational measurement rather than any self-referential or tautological step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three failure modes (discoverability, compellingness, positioning) are exhaustive and mutually exclusive for cases where neither provider recommends a brand.
invented entities (1)
-
Discoverability / compellingness / positioning failure-mode taxonomy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
Jack, W. Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recom- mendation: A 37,000-Run Audit.Unusual.ai Research Series, 2026a
-
[2]
Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b
Jack, W. Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommen- dation: Reproducibility Below the Rerun-Stability Baseline.Unusual.ai Research Series, 2026b
-
[3]
Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
Jack, W. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Com- mercial Chat: A Prominence-Stratified Cross-Provider Audit.Unusual.ai Research Series, 2026c
-
[4]
GEO: Generative Engine Optimization.KDD ’24, 2024
Aggarwal, P., Murahari, V ., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization.KDD ’24, 2024. arXiv:2311.09735
-
[5]
A., Kumar, A., Jurafsky, D., Liang, P
Bommasani, R., Creel, K. A., Kumar, A., Jurafsky, D., Liang, P. Picking on the Same Per- son: Does Algorithmic Monoculture Lead to Outcome Homogenization?NeurIPS, 2022. arXiv:2211.13972
-
[6]
Court, D., Elzinga, D., Mulder, S., Vetvik, O. J. The Consumer Decision Journey.McKinsey Quarterly, 2009
2009
-
[7]
Liang, P. et al. Holistic Evaluation of Language Models.Transactions on Machine Learning Research, 2023. arXiv:2211.09110. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories
Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL, 2023
2023
-
[9]
A Comprehensive Taxonomy of Hallucinations in Large Language Models
Cossio, M. A Comprehensive Taxonomy of Hallucinations in Large Language Models. arXiv:2508.01781, 2025
-
[10]
F., Ilhan, F., Huang, T., Hu, S., Liu, L
Tekin, S. F., Ilhan, F., Huang, T., Hu, S., Liu, L. LLM-TOPLA: Efficient LLM Ensemble by Maximizing Diversity.Findings of EMNLP, 2024
2024
-
[11]
R., Cicchetti, D
Feinstein, A. R., Cicchetti, D. V . High agreement but low kappa: I. The problems of two paradoxes.Journal of Clinical Epidemiology, 43(6):543–549, 1990
1990
-
[12]
Goyal, S., Baek, C., Kolter, J. Z., Raghunathan, A. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance.ICLR, 2025. arXiv:2410.10796
-
[13]
Knowledge conflicts for LLMs: A survey,
Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y ., Xu, W. Knowledge Conflicts for LLMs: A Survey.EMNLP, 2024. arXiv:2403.08319
-
[14]
News Source Citing Patterns in AI Search Systems
Yang, K.-C. News Source Citing Patterns in AI Search Systems. arXiv:2507.05301, 2025
-
[15]
Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023. arXiv:2306.05685. 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.