Assessing the performance of 8 AI chatbots in bibliographic reference retrieval: Grok and DeepSeek outperform ChatGPT, but none are fully accurate
Pith reviewed 2026-05-22 01:54 UTC · model grok-4.3
The pith
Eight AI chatbots produce fully accurate bibliographic references in only 26.5 percent of cases, with Grok and DeepSeek generating no false references.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study shows that none of the eight tested chatbots are fully accurate when generating bibliographic references for academic use. Only 26.5% of references were completely correct, while nearly 40% were either erroneous or entirely made up. Grok and DeepSeek distinguished themselves by not producing any fabricated references at all, although all models exhibited issues such as favoring books over journal articles and sharing similar source suggestions in some cases.
What carries the argument
The five-component evaluation framework for reference accuracy including authorship, year, title, source, and location.
If this is right
- Students using AI for references must verify outputs to avoid errors in academic work.
- AI developers should focus on reducing fabrication in reference generation tasks.
- Higher education needs to emphasize critical assessment of AI-generated content.
- Performance varies by model, suggesting selection of tools matters for this use case.
Where Pith is reading between the lines
- The issues may extend to other AI factual generation tasks like data citation or literature summaries.
- Using multiple prompts or real student queries might provide a broader view of typical performance.
- Shared overlaps point to common data sources or training limitations across models.
Load-bearing premise
The authors' manual judgment of correctness for each reference component is objective and unbiased, and the single standardized prompt used is representative of typical university-level requests for bibliographic references.
What would settle it
Having a second independent team re-evaluate the same set of 400 references for accuracy to confirm the reported percentages.
read the original abstract
This study analyzes the performance of eight generative artificial intelligence chatbots -- ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Le Chat, and Perplexity -- in their free versions, in the task of generating academic bibliographic references within the university context. A total of 400 references were evaluated across the five major areas of knowledge (Health, Engineering, Experimental Sciences, Social Sciences, and Humanities), based on a standardized prompt. Each reference was assessed according to five key components (authorship, year, title, source, and location), along with document type, publication age, and error count. The results show that only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated. Grok and DeepSeek stood out as the only chatbots that did not generate false references, while Copilot, Perplexity, and Claude exhibited the highest hallucination rates. Furthermore, the chatbots showed a greater tendency to generate book references over journal articles, although the latter had a significantly higher fabrication rate. A high degree of overlap was also detected among the sources provided by several models, particularly between DeepSeek, Grok, Gemini, and ChatGPT. These findings reveal structural limitations in current AI models, highlight the risks of uncritical use by students, and underscore the need to strengthen information and critical literacy regarding the use of AI tools in higher education.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper claims to assess the performance of eight AI chatbots in retrieving/generating bibliographic references using a standardized prompt. They evaluated 400 references across five major knowledge areas by breaking down each into five components (authorship, year, title, source, location) and classifying overall as fully correct (26.5%), partially correct (33.8%), or erroneous/fabricated (39.8%). The standout result is that Grok and DeepSeek were the only ones without any false references, while Copilot, Perplexity, and Claude had the highest rates. Other findings include preference for books over journal articles (with higher fabrication in articles) and source overlaps between models like DeepSeek, Grok, Gemini, and ChatGPT. The conclusion stresses limitations of AI and need for critical literacy in higher education.
Significance. If these results hold, the paper provides important evidence that generative AI chatbots are not yet reliable for academic reference generation, with over a third of outputs being erroneous or fabricated even in the best cases. This has practical significance for university students and educators relying on tools like ChatGPT for research tasks. By comparing multiple models and domains, it identifies relative performers (Grok and DeepSeek outperforming others) and common issues like hallucination and source overlap. The empirical nature with a sizable sample offers a baseline for future studies on AI accuracy in information retrieval. Credit is due for the systematic component-based evaluation and the clear aggregate statistics reported.
major comments (2)
- Methods section: No inter-rater reliability statistics (e.g., Cohen's kappa) or detailed scoring rubric are reported for the manual classification of the 400 references into fully correct, partially correct, or erroneous/fabricated categories. This is load-bearing for the central claim, as the assertion that Grok and DeepSeek produced zero false references depends entirely on the consistency and objectivity of these per-component judgments (authorship, year, title, source, location).
- Methods section: The exact wording of the standardized prompt is not provided. Reproducibility and assessment of whether the prompt represents typical university-level requests for bibliographic references therefore cannot be verified, which affects the generalizability of the performance rankings.
minor comments (2)
- Abstract: The total sample size (400) and distribution across knowledge areas could be stated explicitly alongside the aggregate percentages for immediate context.
- Results: Tables or figures presenting model-specific error rates would benefit from clearer definitions of 'false references' to avoid ambiguity in interpreting the zero-hallucination finding for Grok and DeepSeek.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we intend to make to improve methodological transparency and reproducibility.
read point-by-point responses
-
Referee: [—] Methods section: No inter-rater reliability statistics (e.g., Cohen's kappa) or detailed scoring rubric are reported for the manual classification of the 400 references into fully correct, partially correct, or erroneous/fabricated categories. This is load-bearing for the central claim, as the assertion that Grok and DeepSeek produced zero false references depends entirely on the consistency and objectivity of these per-component judgments (authorship, year, title, source, location).
Authors: We thank the referee for highlighting the need for greater methodological detail. The component-level classifications were performed using explicit, verifiable bibliographic standards (e.g., matching against known publication records for authorship, year, title, source, and location). Because the original study design did not employ multiple independent raters, Cohen's kappa is not applicable. In the revised manuscript we will add a comprehensive scoring rubric that defines the precise criteria and decision rules for each component and for the overall categories (fully correct, partially correct, erroneous/fabricated). We will also include an appendix with concrete examples of classified references to demonstrate consistent application of the rubric. These additions will directly support the objectivity of the zero-fabrication result for Grok and DeepSeek. revision: partial
-
Referee: [—] Methods section: The exact wording of the standardized prompt is not provided. Reproducibility and assessment of whether the prompt represents typical university-level requests for bibliographic references therefore cannot be verified, which affects the generalizability of the performance rankings.
Authors: We agree that the exact prompt wording is necessary for reproducibility and for readers to judge how closely the task mirrors typical university requests. In the revised manuscript we will include the full, verbatim text of the standardized prompt that was issued to all eight chatbots across the five knowledge areas. This addition will allow direct assessment of the prompt's representativeness and will strengthen the generalizability claims. revision: yes
Circularity Check
Pure empirical measurement study with no derivations or self-referential reductions
full rationale
The paper conducts a direct empirical evaluation: a standardized prompt is issued to eight chatbots, 400 generated references are manually scored component-wise (authorship, year, title, source, location) for correctness, and aggregate percentages plus per-model hallucination rates are reported. No equations, fitted parameters, theoretical derivations, or load-bearing self-citations appear; the reported statistics (26.5% fully correct, 39.8% erroneous/fabricated, zero false references for Grok/DeepSeek) are simple tallies of observed outputs against external ground truth. The study is therefore self-contained against external benchmarks and contains no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single standardized prompt elicits representative bibliographic reference behavior from the tested AI models across knowledge areas.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Grok and DeepSeek stood out as the only chatbots that did not generate false references
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.