pith. sign in

arxiv: 2605.04171 · v1 · submitted 2026-05-05 · 💻 cs.CL

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

Pith reviewed 2026-05-08 17:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords hallucinationslarge language modelsacademic writingreference generationfactualityHallucination IndexChatGPTGrok
0
0 comments X

The pith

Hallucinations in large language models for academic writing vary by task and prompting conditions rather than model architecture alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four large language models on academic writing by feeding them 80 prompts split into reference generation, factual explanation, abstract generation, and writing improvement. It scores the outputs with a 0-5 rubric covering factual accuracy, reference validity, coherence, style consistency, and academic tone, then combines those scores into a new weighted Hallucination Index. The results show Grok and Copilot produce fewer hallucinations on reference tasks while Gemini and ChatGPT maintain better tone but introduce more factual errors. Overall the work establishes that hallucination rates shift with the kind of academic task and the exact prompt given, not only with which model is used.

Core claim

By testing ChatGPT, Grok, Gemini, and Copilot on four categories of academic prompts and scoring outputs with a new Hallucination Index, the study demonstrates that no single model is consistently superior; instead, performance and hallucination rates shift with the nature of the writing task and the prompt conditions provided.

What carries the argument

The Hallucination Index, a novel weighted metric derived from 0-5 rubric scores on factual accuracy, reference validity, coherence, style consistency, and academic tone.

If this is right

  • Grok and Copilot perform better on reference generation while Gemini and ChatGPT show stronger tone control.
  • Abstract and stylistic prompts increase hallucination risk across all four models.
  • Hallucination rates change with both task category and specific prompting conditions.
  • The Hallucination Index offers a quantitative tool for comparing model reliability on academic tasks.
  • Future work can explore prompt designs that reduce hallucinations in particular academic subtasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users may need to match specific models to the academic subtask rather than choosing one model for all writing needs.
  • Prompt engineering techniques could be developed to compensate for each model's observed weaknesses.
  • The same task-and-prompt dependence might appear in non-academic domains such as technical documentation or legal drafting.
  • Developers could incorporate task-type signals into model training to reduce hallucinations more effectively than architecture changes alone.

Load-bearing premise

The 0-5 rubric scores and the derived Hallucination Index accurately reflect true hallucinations without significant bias from the evaluators or the chosen prompts.

What would settle it

Re-scoring the same model outputs with independent academic experts or applying the rubric to outputs from a wider set of models and prompts to check whether the reported task-dependent HI patterns remain stable.

Figures

Figures reproduced from arXiv: 2605.04171 by Aqeel Khalique, Humam Khan, Md Tabrez Nafis, Rehan Hasan Khan, Shahab Saquib Sohail.

Figure 3
Figure 3. Figure 3: Heatmap - HI (%) by Task Category Model Reference Generation Factual Explanation Abstract Generation Writing Improvement (F_d/R / HI%) (F_d / R / HI%) (F_d / R / HI%) (F_d / R / HI%) Grok 4.4 / 4.5 / 78.0 4.1 / 3.9 / 66.09 2.4 / 4.4 / 61.73 4.9 / 4.6 / 77.09 Copilot 4.3 / 4.2 / 77.5 4.7 / 5.0 / 88.37 3.5 / 2.8 / 60.73 1.8 / 3.3 / 57.73 Gemini 3.9 / 3.7 / 74.0 2.8 / 2.1 / 55.18 1.4 / 2.7 / 41.27 3.5 / 2.0 /… view at source ↗
Figure 4
Figure 4. Figure 4: Frequency of Low-Quality Scores (Score<3) per Model d) Frequency of Low-Quality Scores - view at source ↗
read the original abstract

Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper investigates hallucinations in four LLMs (ChatGPT, Grok, Gemini, Copilot) for academic writing tasks. It uses 80 author-designed prompts across four categories (reference generation, factual explanation, abstract generation, writing improvement), evaluates outputs via a 0-5 rubric on factual accuracy, reference validity, coherence, style, and tone, and introduces a novel weighted Hallucination Index (HI) to quantify hallucinations. Results show model-specific patterns (e.g., Grok/Copilot HI 0.67–0.70 on reference tasks but higher on abstracts; Gemini/ChatGPT HI 0.53–0.57 with better tone control but factual weaknesses), leading to the claim that hallucination depends on task type and prompting conditions, not solely on model architecture.

Significance. If the rubric and HI are shown to be reliable and bias-free, the work would be significant for NLP and AI-assisted scholarship by demonstrating that hallucination rates are task- and prompt-dependent. This could shift focus from model selection alone to prompt engineering and task-specific safeguards in academic applications, opening research on mitigation strategies tailored to writing subtasks.

major comments (3)
  1. [Abstract/Methods] Abstract/Methods: The 0-5 rubric and its aggregation into the weighted Hallucination Index are central to all reported HI values and the task-dependence claim, yet no details are provided on rubric application criteria, number of evaluators, inter-rater agreement statistics, or how disagreements were resolved. Without these, observed HI spreads (e.g., 0.67 vs. 0.53) cannot be distinguished from evaluator subjectivity.
  2. [Abstract/Results] Abstract/Results: The weights used in the novel Hallucination Index are unspecified (free parameters), with no ablation, correlation to established hallucination metrics (e.g., factuality benchmarks), or external validation against human fact-checking. This directly undermines attribution of HI differences to task/prompt factors rather than metric construction.
  3. [Results] Results: No statistical tests, confidence intervals, or analysis of prompt sampling/balancing are reported for the 80 prompts or HI comparisons across the four categories. This leaves open whether differences reflect genuine task dependence or selection bias in the author-designed prompts.
minor comments (1)
  1. [Abstract] Abstract: The sentence on evaluation metrics failing for sentiment-altering errors in machine-translated text is out of place and unrelated to the LLM academic-writing focus; it should be removed or replaced with a relevant limitation statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment point by point below, indicating where revisions will be made to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract/Methods] The 0-5 rubric and its aggregation into the weighted Hallucination Index are central to all reported HI values and the task-dependence claim, yet no details are provided on rubric application criteria, number of evaluators, inter-rater agreement statistics, or how disagreements were resolved. Without these, observed HI spreads (e.g., 0.67 vs. 0.53) cannot be distinguished from evaluator subjectivity.

    Authors: We acknowledge the need for greater transparency in the evaluation process. The 0-5 rubric was defined with explicit criteria for each dimension (e.g., factual accuracy scored 0 for major fabrications to 5 for fully verifiable content; reference validity from invalid citations to fully accurate). All 80 outputs were scored by the lead author for consistency. We will revise the Methods section to include the full rubric table with scoring anchors and explicitly state the single-evaluator design. This makes the process reproducible and allows readers to evaluate subjectivity. revision: yes

  2. Referee: [Abstract/Results] The weights used in the novel Hallucination Index are unspecified (free parameters), with no ablation, correlation to established hallucination metrics (e.g., factuality benchmarks), or external validation against human fact-checking. This directly undermines attribution of HI differences to task/prompt factors rather than metric construction.

    Authors: The HI is a weighted sum of the five rubric dimensions, with weights reflecting their relative priority in academic writing (factual accuracy and reference validity receiving higher emphasis). We will add the exact weights and their justification to the revised manuscript. We did not perform ablation or external validation in the original study, as the focus was on introducing the index for this task set; we will note this as a limitation and suggest it for future work rather than claiming robustness beyond the current scope. revision: partial

  3. Referee: [Results] No statistical tests, confidence intervals, or analysis of prompt sampling/balancing are reported for the 80 prompts or HI comparisons across the four categories. This leaves open whether differences reflect genuine task dependence or selection bias in the author-designed prompts.

    Authors: We agree this is a limitation of the current presentation. The 80 prompts were manually designed with 20 per category to represent common academic writing scenarios. In the revision we will report per-category means and standard deviations for HI scores and add a note on the author-designed, non-random prompt selection. Formal hypothesis testing is not feasible without a larger, independently sampled prompt corpus, but the consistent model-by-task patterns across the four categories still support our preliminary claim of task dependence. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivation chain

full rationale

The paper performs an empirical study: 80 author-designed prompts in four categories are fed to four LLMs, outputs are scored on a 0-5 rubric across five dimensions, and a weighted Hallucination Index is computed from those scores. No equations, first-principles derivations, fitted parameters, or uniqueness theorems are claimed. The central finding (hallucination rates vary by task and prompt type) is a direct summary of the observed score differences; it does not reduce to any self-referential definition or input by construction. The novel HI metric is introduced without any reported fitting or self-citation that would make later results tautological. This is a standard empirical measurement study whose conclusions stand or fall on the quality of the rubric and prompt set, not on any circular logical step.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that the custom rubric and weighted HI metric validly quantify hallucinations; the metric itself is introduced without external benchmarks or validation data.

free parameters (1)
  • Weights for Hallucination Index
    HI is described as a novel weighted metric combining rubric dimensions; specific weights are not stated but must be chosen or fitted to produce the reported scores.
invented entities (1)
  • Hallucination Index (HI) no independent evidence
    purpose: Composite score to quantify hallucination severity in LLM academic outputs
    New metric created for this study with no independent evidence or prior validation mentioned.

pith-pipeline@v0.9.0 · 5580 in / 1311 out tokens · 57725 ms · 2026-05-08T17:50:07.545463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Hallucination augmented contrastive learning for multimodal large language model,

    C. Jiang et al. , “Hallucination augmented contrastive learning for multimodal large language model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024

  2. [2]

    Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,

    X. Guan et al. , “Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 16, 2024

  3. [3]

    Confabulation: The surprising value of large language model hallucinations,

    P . Sui et al. , “Confabulation: The surprising value of large language model hallucinations,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2024

  4. [4]

    Faithfulness hallucination detection in healthcare AI,

    P. R. Vishwanath et al. , “Faithfulness hallucination detection in healthcare AI,” in Artif. Intell. Data Sci. Healthcare: Bridging Data-Centric AI People-Centric Healthcare, 2024

  5. [5]

    On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,

    X. Jing, S. Billa, and D. Godbout, “On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,” Findings Assoc. Comput. Linguistics: NAACL, 2025

  6. [6]

    Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,

    M. Cao, Y . Dong, and J. C. K. Cheung, “Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2022

  7. [7]

    Today's academic research: The role of ChatGPT writing,

    J. E. Chukwuere, “Today's academic research: The role of ChatGPT writing,” J. Inf. Syst. Informat., vol. 6, no. 1, pp. 30–46, 2024

  8. [8]

    ChatGPT in academic writing and publishing: A comprehensive guide,

    M. Zohery, “ChatGPT in academic writing and publishing: A comprehensive guide,” Artif. Intell. Academia, Res. Sci.: ChatGPT Case Study, vol. 1, no. 5, 2023

  9. [9]

    ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,

    G. F. Lendvai, “ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,” J. Empirical Res. Human Res. Ethics, p. 15562646251350203, 2025

  10. [10]

    Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,

    A. M. Jarrah, Y . Wardat, and P . Fidalgo, “Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,” Online J. Commun. Media Technol., vol. 13, no. 4, p. e202346, 2023

  11. [11]

    Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,

    E. Safran and A. Çalı, “Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,” Anatolian Current Med. J., vol. 7, no. 5, pp. 695–702, 2025

  12. [12]

    Review of reference generation methods in large language models,

    P. Pattnayak et al., “Review of reference generation methods in large language models,” J. ID 9339, p. 1263, 2024. (Note: Journal details appear incomplete in source; formatted based on available data.)

  13. [13]

    Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,

    B. Edelman and J. Skolnick, “Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,” BMC Bioinf., vol. 26, no. 1, p. 140, 2025

  14. [14]

    Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,

    H. Pratama, “Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,” in Proc. Conf. English Lang. Teaching, 2025

  15. [15]

    An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,

    E. J. Gantana et al., “An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,” Next Res., p. 101002, 2025

  16. [16]

    ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,

    K. Sharun et al., “ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,” Ann. Med. Surgery, vol. 85, no. 10, pp. 5275–5278, 2023

  17. [17]

    ‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,

    D. P. Acut et al., “‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,” Internet Ref. Services Quart., vol. 29, no. 1, pp. 27–54, 2025

  18. [18]

    How trustworthy is ChatGPT? The case of bibliometric analyses,

    F. Farhat et al., “How trustworthy is ChatGPT? The case of bibliometric analyses,” Cogent Engineering. V ol. 10, no.1, 2023

  19. [19]

    ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,

    H. S. AlSagri et al., “ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,” Journal of Academic Ethics . V ol. 23, no.3, 2025

  20. [20]

    Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,

    S. S. Sohail et al., “Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,” Journal of King Saud University-Computer and Information Sciences. V ol. 35. No. 8, 2023

  21. [21]

    Multimodal fine-tuning of LLMs for robust document visual question answering,

    S. Tripathi, M. T. Nafis, I. Hussain, and A. K. J. Saudagar, “Multimodal fine-tuning of LLMs for robust document visual question answering,” *IEEE Access*, vol. 13, pp. 174611 –174623, 2025, doi: 10.1109/ACCESS.2025.3615201

  22. [22]

    BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,

    H. Saadany and C. Orăsan, “BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,” in Proceedings of the Translation and Interpreting Technology Online Conference (TRITON), INCOMA Ltd., 2021, pp. 48–56