Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
Pith reviewed 2026-05-08 17:50 UTC · model grok-4.3
The pith
Hallucinations in large language models for academic writing vary by task and prompting conditions rather than model architecture alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By testing ChatGPT, Grok, Gemini, and Copilot on four categories of academic prompts and scoring outputs with a new Hallucination Index, the study demonstrates that no single model is consistently superior; instead, performance and hallucination rates shift with the nature of the writing task and the prompt conditions provided.
What carries the argument
The Hallucination Index, a novel weighted metric derived from 0-5 rubric scores on factual accuracy, reference validity, coherence, style consistency, and academic tone.
If this is right
- Grok and Copilot perform better on reference generation while Gemini and ChatGPT show stronger tone control.
- Abstract and stylistic prompts increase hallucination risk across all four models.
- Hallucination rates change with both task category and specific prompting conditions.
- The Hallucination Index offers a quantitative tool for comparing model reliability on academic tasks.
- Future work can explore prompt designs that reduce hallucinations in particular academic subtasks.
Where Pith is reading between the lines
- Users may need to match specific models to the academic subtask rather than choosing one model for all writing needs.
- Prompt engineering techniques could be developed to compensate for each model's observed weaknesses.
- The same task-and-prompt dependence might appear in non-academic domains such as technical documentation or legal drafting.
- Developers could incorporate task-type signals into model training to reduce hallucinations more effectively than architecture changes alone.
Load-bearing premise
The 0-5 rubric scores and the derived Hallucination Index accurately reflect true hallucinations without significant bias from the evaluators or the chosen prompts.
What would settle it
Re-scoring the same model outputs with independent academic experts or applying the rubric to outputs from a wider set of models and prompts to check whether the reported task-dependent HI patterns remain stable.
Figures
read the original abstract
Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates hallucinations in four LLMs (ChatGPT, Grok, Gemini, Copilot) for academic writing tasks. It uses 80 author-designed prompts across four categories (reference generation, factual explanation, abstract generation, writing improvement), evaluates outputs via a 0-5 rubric on factual accuracy, reference validity, coherence, style, and tone, and introduces a novel weighted Hallucination Index (HI) to quantify hallucinations. Results show model-specific patterns (e.g., Grok/Copilot HI 0.67–0.70 on reference tasks but higher on abstracts; Gemini/ChatGPT HI 0.53–0.57 with better tone control but factual weaknesses), leading to the claim that hallucination depends on task type and prompting conditions, not solely on model architecture.
Significance. If the rubric and HI are shown to be reliable and bias-free, the work would be significant for NLP and AI-assisted scholarship by demonstrating that hallucination rates are task- and prompt-dependent. This could shift focus from model selection alone to prompt engineering and task-specific safeguards in academic applications, opening research on mitigation strategies tailored to writing subtasks.
major comments (3)
- [Abstract/Methods] Abstract/Methods: The 0-5 rubric and its aggregation into the weighted Hallucination Index are central to all reported HI values and the task-dependence claim, yet no details are provided on rubric application criteria, number of evaluators, inter-rater agreement statistics, or how disagreements were resolved. Without these, observed HI spreads (e.g., 0.67 vs. 0.53) cannot be distinguished from evaluator subjectivity.
- [Abstract/Results] Abstract/Results: The weights used in the novel Hallucination Index are unspecified (free parameters), with no ablation, correlation to established hallucination metrics (e.g., factuality benchmarks), or external validation against human fact-checking. This directly undermines attribution of HI differences to task/prompt factors rather than metric construction.
- [Results] Results: No statistical tests, confidence intervals, or analysis of prompt sampling/balancing are reported for the 80 prompts or HI comparisons across the four categories. This leaves open whether differences reflect genuine task dependence or selection bias in the author-designed prompts.
minor comments (1)
- [Abstract] Abstract: The sentence on evaluation metrics failing for sentiment-altering errors in machine-translated text is out of place and unrelated to the LLM academic-writing focus; it should be removed or replaced with a relevant limitation statement.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment point by point below, indicating where revisions will be made to improve transparency and rigor.
read point-by-point responses
-
Referee: [Abstract/Methods] The 0-5 rubric and its aggregation into the weighted Hallucination Index are central to all reported HI values and the task-dependence claim, yet no details are provided on rubric application criteria, number of evaluators, inter-rater agreement statistics, or how disagreements were resolved. Without these, observed HI spreads (e.g., 0.67 vs. 0.53) cannot be distinguished from evaluator subjectivity.
Authors: We acknowledge the need for greater transparency in the evaluation process. The 0-5 rubric was defined with explicit criteria for each dimension (e.g., factual accuracy scored 0 for major fabrications to 5 for fully verifiable content; reference validity from invalid citations to fully accurate). All 80 outputs were scored by the lead author for consistency. We will revise the Methods section to include the full rubric table with scoring anchors and explicitly state the single-evaluator design. This makes the process reproducible and allows readers to evaluate subjectivity. revision: yes
-
Referee: [Abstract/Results] The weights used in the novel Hallucination Index are unspecified (free parameters), with no ablation, correlation to established hallucination metrics (e.g., factuality benchmarks), or external validation against human fact-checking. This directly undermines attribution of HI differences to task/prompt factors rather than metric construction.
Authors: The HI is a weighted sum of the five rubric dimensions, with weights reflecting their relative priority in academic writing (factual accuracy and reference validity receiving higher emphasis). We will add the exact weights and their justification to the revised manuscript. We did not perform ablation or external validation in the original study, as the focus was on introducing the index for this task set; we will note this as a limitation and suggest it for future work rather than claiming robustness beyond the current scope. revision: partial
-
Referee: [Results] No statistical tests, confidence intervals, or analysis of prompt sampling/balancing are reported for the 80 prompts or HI comparisons across the four categories. This leaves open whether differences reflect genuine task dependence or selection bias in the author-designed prompts.
Authors: We agree this is a limitation of the current presentation. The 80 prompts were manually designed with 20 per category to represent common academic writing scenarios. In the revision we will report per-category means and standard deviations for HI scores and add a note on the author-designed, non-random prompt selection. Formal hypothesis testing is not feasible without a larger, independently sampled prompt corpus, but the consistent model-by-task patterns across the four categories still support our preliminary claim of task dependence. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivation chain
full rationale
The paper performs an empirical study: 80 author-designed prompts in four categories are fed to four LLMs, outputs are scored on a 0-5 rubric across five dimensions, and a weighted Hallucination Index is computed from those scores. No equations, first-principles derivations, fitted parameters, or uniqueness theorems are claimed. The central finding (hallucination rates vary by task and prompt type) is a direct summary of the observed score differences; it does not reduce to any self-referential definition or input by construction. The novel HI metric is introduced without any reported fitting or self-citation that would make later results tautological. This is a standard empirical measurement study whose conclusions stand or fall on the quality of the rubric and prompt set, not on any circular logical step.
Axiom & Free-Parameter Ledger
free parameters (1)
- Weights for Hallucination Index
invented entities (1)
-
Hallucination Index (HI)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination... we gave the highest weight to factual accuracy (4) and reference validity (3)... coherence a medium weight (2)... Style consistency and academic tone were given the lowest weight (1 each)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hallucination augmented contrastive learning for multimodal large language model,
C. Jiang et al. , “Hallucination augmented contrastive learning for multimodal large language model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024
work page 2024
-
[2]
Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,
X. Guan et al. , “Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 16, 2024
work page 2024
-
[3]
Confabulation: The surprising value of large language model hallucinations,
P . Sui et al. , “Confabulation: The surprising value of large language model hallucinations,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2024
work page 2024
-
[4]
Faithfulness hallucination detection in healthcare AI,
P. R. Vishwanath et al. , “Faithfulness hallucination detection in healthcare AI,” in Artif. Intell. Data Sci. Healthcare: Bridging Data-Centric AI People-Centric Healthcare, 2024
work page 2024
-
[5]
On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,
X. Jing, S. Billa, and D. Godbout, “On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,” Findings Assoc. Comput. Linguistics: NAACL, 2025
work page 2025
-
[6]
Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,
M. Cao, Y . Dong, and J. C. K. Cheung, “Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2022
work page 2022
-
[7]
Today's academic research: The role of ChatGPT writing,
J. E. Chukwuere, “Today's academic research: The role of ChatGPT writing,” J. Inf. Syst. Informat., vol. 6, no. 1, pp. 30–46, 2024
work page 2024
-
[8]
ChatGPT in academic writing and publishing: A comprehensive guide,
M. Zohery, “ChatGPT in academic writing and publishing: A comprehensive guide,” Artif. Intell. Academia, Res. Sci.: ChatGPT Case Study, vol. 1, no. 5, 2023
work page 2023
-
[9]
ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,
G. F. Lendvai, “ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,” J. Empirical Res. Human Res. Ethics, p. 15562646251350203, 2025
work page 2022
-
[10]
Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,
A. M. Jarrah, Y . Wardat, and P . Fidalgo, “Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,” Online J. Commun. Media Technol., vol. 13, no. 4, p. e202346, 2023
work page 2023
-
[11]
E. Safran and A. Çalı, “Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,” Anatolian Current Med. J., vol. 7, no. 5, pp. 695–702, 2025
work page 2025
-
[12]
Review of reference generation methods in large language models,
P. Pattnayak et al., “Review of reference generation methods in large language models,” J. ID 9339, p. 1263, 2024. (Note: Journal details appear incomplete in source; formatted based on available data.)
work page 2024
-
[13]
B. Edelman and J. Skolnick, “Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,” BMC Bioinf., vol. 26, no. 1, p. 140, 2025
work page 2025
-
[14]
H. Pratama, “Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,” in Proc. Conf. English Lang. Teaching, 2025
work page 2025
-
[15]
E. J. Gantana et al., “An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,” Next Res., p. 101002, 2025
work page 2025
-
[16]
K. Sharun et al., “ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,” Ann. Med. Surgery, vol. 85, no. 10, pp. 5275–5278, 2023
work page 2023
-
[17]
D. P. Acut et al., “‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,” Internet Ref. Services Quart., vol. 29, no. 1, pp. 27–54, 2025
work page 2025
-
[18]
How trustworthy is ChatGPT? The case of bibliometric analyses,
F. Farhat et al., “How trustworthy is ChatGPT? The case of bibliometric analyses,” Cogent Engineering. V ol. 10, no.1, 2023
work page 2023
-
[19]
ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,
H. S. AlSagri et al., “ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,” Journal of Academic Ethics . V ol. 23, no.3, 2025
work page 2025
-
[20]
S. S. Sohail et al., “Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,” Journal of King Saud University-Computer and Information Sciences. V ol. 35. No. 8, 2023
work page 2023
-
[21]
Multimodal fine-tuning of LLMs for robust document visual question answering,
S. Tripathi, M. T. Nafis, I. Hussain, and A. K. J. Saudagar, “Multimodal fine-tuning of LLMs for robust document visual question answering,” *IEEE Access*, vol. 13, pp. 174611 –174623, 2025, doi: 10.1109/ACCESS.2025.3615201
-
[22]
H. Saadany and C. Orăsan, “BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,” in Proceedings of the Translation and Interpreting Technology Online Conference (TRITON), INCOMA Ltd., 2021, pp. 48–56
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.