Assessing the performance of 8 AI chatbots in bibliographic reference retrieval: Grok and DeepSeek outperform ChatGPT, but none are fully accurate

\'Alvaro Cabezas-Clavijo; Pavel Sidorenko-Bautista

arxiv: 2505.18059 · v1 · pith:MBFVXNGNnew · submitted 2025-05-23 · 💻 cs.IR

Assessing the performance of 8 AI chatbots in bibliographic reference retrieval: Grok and DeepSeek outperform ChatGPT, but none are fully accurate

\'Alvaro Cabezas-Clavijo , Pavel Sidorenko-Bautista This is my paper

Pith reviewed 2026-05-22 01:54 UTC · model grok-4.3

classification 💻 cs.IR

keywords AI chatbotsbibliographic referenceshallucinationsacademic integritygenerative AIreference retrievalChatGPTGrok

0 comments

The pith

Eight AI chatbots produce fully accurate bibliographic references in only 26.5 percent of cases, with Grok and DeepSeek generating no false references.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests eight AI chatbots on their ability to generate academic bibliographic references using a standardized prompt across five fields of knowledge. It finds that only 26.5% of 400 evaluated references were fully correct, with 33.8% partially correct and 39.8% erroneous or fabricated. Grok and DeepSeek were the only ones that did not generate any false references, while others like Copilot had higher hallucination rates. The models also showed a preference for generating book references over journal articles, and there was overlap in the sources suggested by different models. These findings indicate structural limitations in AI for this task and risks for uncritical use in education.

Core claim

The study shows that none of the eight tested chatbots are fully accurate when generating bibliographic references for academic use. Only 26.5% of references were completely correct, while nearly 40% were either erroneous or entirely made up. Grok and DeepSeek distinguished themselves by not producing any fabricated references at all, although all models exhibited issues such as favoring books over journal articles and sharing similar source suggestions in some cases.

What carries the argument

The five-component evaluation framework for reference accuracy including authorship, year, title, source, and location.

If this is right

Students using AI for references must verify outputs to avoid errors in academic work.
AI developers should focus on reducing fabrication in reference generation tasks.
Higher education needs to emphasize critical assessment of AI-generated content.
Performance varies by model, suggesting selection of tools matters for this use case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The issues may extend to other AI factual generation tasks like data citation or literature summaries.
Using multiple prompts or real student queries might provide a broader view of typical performance.
Shared overlaps point to common data sources or training limitations across models.

Load-bearing premise

The authors' manual judgment of correctness for each reference component is objective and unbiased, and the single standardized prompt used is representative of typical university-level requests for bibliographic references.

What would settle it

Having a second independent team re-evaluate the same set of 400 references for accuracy to confirm the reported percentages.

read the original abstract

This study analyzes the performance of eight generative artificial intelligence chatbots -- ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Le Chat, and Perplexity -- in their free versions, in the task of generating academic bibliographic references within the university context. A total of 400 references were evaluated across the five major areas of knowledge (Health, Engineering, Experimental Sciences, Social Sciences, and Humanities), based on a standardized prompt. Each reference was assessed according to five key components (authorship, year, title, source, and location), along with document type, publication age, and error count. The results show that only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated. Grok and DeepSeek stood out as the only chatbots that did not generate false references, while Copilot, Perplexity, and Claude exhibited the highest hallucination rates. Furthermore, the chatbots showed a greater tendency to generate book references over journal articles, although the latter had a significantly higher fabrication rate. A high degree of overlap was also detected among the sources provided by several models, particularly between DeepSeek, Grok, Gemini, and ChatGPT. These findings reveal structural limitations in current AI models, highlight the risks of uncritical use by students, and underscore the need to strengthen information and critical literacy regarding the use of AI tools in higher education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper claims to assess the performance of eight AI chatbots in retrieving/generating bibliographic references using a standardized prompt. They evaluated 400 references across five major knowledge areas by breaking down each into five components (authorship, year, title, source, location) and classifying overall as fully correct (26.5%), partially correct (33.8%), or erroneous/fabricated (39.8%). The standout result is that Grok and DeepSeek were the only ones without any false references, while Copilot, Perplexity, and Claude had the highest rates. Other findings include preference for books over journal articles (with higher fabrication in articles) and source overlaps between models like DeepSeek, Grok, Gemini, and ChatGPT. The conclusion stresses limitations of AI and need for critical literacy in higher education.

Significance. If these results hold, the paper provides important evidence that generative AI chatbots are not yet reliable for academic reference generation, with over a third of outputs being erroneous or fabricated even in the best cases. This has practical significance for university students and educators relying on tools like ChatGPT for research tasks. By comparing multiple models and domains, it identifies relative performers (Grok and DeepSeek outperforming others) and common issues like hallucination and source overlap. The empirical nature with a sizable sample offers a baseline for future studies on AI accuracy in information retrieval. Credit is due for the systematic component-based evaluation and the clear aggregate statistics reported.

major comments (2)

Methods section: No inter-rater reliability statistics (e.g., Cohen's kappa) or detailed scoring rubric are reported for the manual classification of the 400 references into fully correct, partially correct, or erroneous/fabricated categories. This is load-bearing for the central claim, as the assertion that Grok and DeepSeek produced zero false references depends entirely on the consistency and objectivity of these per-component judgments (authorship, year, title, source, location).
Methods section: The exact wording of the standardized prompt is not provided. Reproducibility and assessment of whether the prompt represents typical university-level requests for bibliographic references therefore cannot be verified, which affects the generalizability of the performance rankings.

minor comments (2)

Abstract: The total sample size (400) and distribution across knowledge areas could be stated explicitly alongside the aggregate percentages for immediate context.
Results: Tables or figures presenting model-specific error rates would benefit from clearer definitions of 'false references' to avoid ambiguity in interpreting the zero-hallucination finding for Grok and DeepSeek.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we intend to make to improve methodological transparency and reproducibility.

read point-by-point responses

Referee: [—] Methods section: No inter-rater reliability statistics (e.g., Cohen's kappa) or detailed scoring rubric are reported for the manual classification of the 400 references into fully correct, partially correct, or erroneous/fabricated categories. This is load-bearing for the central claim, as the assertion that Grok and DeepSeek produced zero false references depends entirely on the consistency and objectivity of these per-component judgments (authorship, year, title, source, location).

Authors: We thank the referee for highlighting the need for greater methodological detail. The component-level classifications were performed using explicit, verifiable bibliographic standards (e.g., matching against known publication records for authorship, year, title, source, and location). Because the original study design did not employ multiple independent raters, Cohen's kappa is not applicable. In the revised manuscript we will add a comprehensive scoring rubric that defines the precise criteria and decision rules for each component and for the overall categories (fully correct, partially correct, erroneous/fabricated). We will also include an appendix with concrete examples of classified references to demonstrate consistent application of the rubric. These additions will directly support the objectivity of the zero-fabrication result for Grok and DeepSeek. revision: partial
Referee: [—] Methods section: The exact wording of the standardized prompt is not provided. Reproducibility and assessment of whether the prompt represents typical university-level requests for bibliographic references therefore cannot be verified, which affects the generalizability of the performance rankings.

Authors: We agree that the exact prompt wording is necessary for reproducibility and for readers to judge how closely the task mirrors typical university requests. In the revised manuscript we will include the full, verbatim text of the standardized prompt that was issued to all eight chatbots across the five knowledge areas. This addition will allow direct assessment of the prompt's representativeness and will strengthen the generalizability claims. revision: yes

Circularity Check

0 steps flagged

Pure empirical measurement study with no derivations or self-referential reductions

full rationale

The paper conducts a direct empirical evaluation: a standardized prompt is issued to eight chatbots, 400 generated references are manually scored component-wise (authorship, year, title, source, location) for correctness, and aggregate percentages plus per-model hallucination rates are reported. No equations, fitted parameters, theoretical derivations, or load-bearing self-citations appear; the reported statistics (26.5% fully correct, 39.8% erroneous/fabricated, zero false references for Grok/DeepSeek) are simple tallies of observed outputs against external ground truth. The study is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claims rest on the representativeness of one standardized prompt and the consistency of author-led correctness judgments across 400 references; no free parameters or invented entities are introduced.

axioms (1)

domain assumption A single standardized prompt elicits representative bibliographic reference behavior from the tested AI models across knowledge areas.
All evaluations used the same prompt; variation in real user queries is not tested.

pith-pipeline@v0.9.0 · 5818 in / 1255 out tokens · 60874 ms · 2026-05-22T01:54:33.563173+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Grok and DeepSeek stood out as the only chatbots that did not generate false references

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.