Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

Aqeel Khalique; Humam Khan; Md Tabrez Nafis; Rehan Hasan Khan; Shahab Saquib Sohail

arxiv: 2605.04171 · v1 · submitted 2026-05-05 · 💻 cs.CL

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

Humam Khan , Md Tabrez Nafis , Shahab Saquib Sohail , Aqeel Khalique , Rehan Hasan Khan This is my paper

Pith reviewed 2026-05-08 17:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucinationslarge language modelsacademic writingreference generationfactualityHallucination IndexChatGPTGrok

0 comments

The pith

Hallucinations in large language models for academic writing vary by task and prompting conditions rather than model architecture alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four large language models on academic writing by feeding them 80 prompts split into reference generation, factual explanation, abstract generation, and writing improvement. It scores the outputs with a 0-5 rubric covering factual accuracy, reference validity, coherence, style consistency, and academic tone, then combines those scores into a new weighted Hallucination Index. The results show Grok and Copilot produce fewer hallucinations on reference tasks while Gemini and ChatGPT maintain better tone but introduce more factual errors. Overall the work establishes that hallucination rates shift with the kind of academic task and the exact prompt given, not only with which model is used.

Core claim

By testing ChatGPT, Grok, Gemini, and Copilot on four categories of academic prompts and scoring outputs with a new Hallucination Index, the study demonstrates that no single model is consistently superior; instead, performance and hallucination rates shift with the nature of the writing task and the prompt conditions provided.

What carries the argument

The Hallucination Index, a novel weighted metric derived from 0-5 rubric scores on factual accuracy, reference validity, coherence, style consistency, and academic tone.

If this is right

Grok and Copilot perform better on reference generation while Gemini and ChatGPT show stronger tone control.
Abstract and stylistic prompts increase hallucination risk across all four models.
Hallucination rates change with both task category and specific prompting conditions.
The Hallucination Index offers a quantitative tool for comparing model reliability on academic tasks.
Future work can explore prompt designs that reduce hallucinations in particular academic subtasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users may need to match specific models to the academic subtask rather than choosing one model for all writing needs.
Prompt engineering techniques could be developed to compensate for each model's observed weaknesses.
The same task-and-prompt dependence might appear in non-academic domains such as technical documentation or legal drafting.
Developers could incorporate task-type signals into model training to reduce hallucinations more effectively than architecture changes alone.

Load-bearing premise

The 0-5 rubric scores and the derived Hallucination Index accurately reflect true hallucinations without significant bias from the evaluators or the chosen prompts.

What would settle it

Re-scoring the same model outputs with independent academic experts or applying the rubric to outputs from a wider set of models and prompts to check whether the reported task-dependent HI patterns remain stable.

Figures

Figures reproduced from arXiv: 2605.04171 by Aqeel Khalique, Humam Khan, Md Tabrez Nafis, Rehan Hasan Khan, Shahab Saquib Sohail.

**Figure 3.** Figure 3: Heatmap - HI (%) by Task Category Model Reference Generation Factual Explanation Abstract Generation Writing Improvement (F_d/R / HI%) (F_d / R / HI%) (F_d / R / HI%) (F_d / R / HI%) Grok 4.4 / 4.5 / 78.0 4.1 / 3.9 / 66.09 2.4 / 4.4 / 61.73 4.9 / 4.6 / 77.09 Copilot 4.3 / 4.2 / 77.5 4.7 / 5.0 / 88.37 3.5 / 2.8 / 60.73 1.8 / 3.3 / 57.73 Gemini 3.9 / 3.7 / 74.0 2.8 / 2.1 / 55.18 1.4 / 2.7 / 41.27 3.5 / 2.0 /… view at source ↗

**Figure 4.** Figure 4: Frequency of Low-Quality Scores (Score<3) per Model d) Frequency of Low-Quality Scores - view at source ↗

read the original abstract

Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper compares four LLMs on academic writing prompts and introduces a Hallucination Index, but the metric and rubric lack reported validation so the task differences are hard to trust.

read the letter

This paper compares hallucination rates in ChatGPT, Grok, Gemini, and Copilot when generating academic content. The authors created 80 prompts in four categories—reference generation, factual explanation, abstract generation, and writing improvement—and scored outputs on a 0-5 rubric for factual accuracy, reference validity, coherence, style, and tone. They then combined those into a weighted Hallucination Index. What the work does is apply this to real academic-style tasks and report that the models differ by task: Grok and Copilot score better on references but worse on abstracts and style, while Gemini and ChatGPT manage tone better but have more issues with facts. The claim is that hallucination depends on the task and prompting as much as on the model itself. The new part is the Hallucination Index tailored to academic writing and the breakdown by prompt category. That could help users pick a model for specific writing jobs. The weak part is the lack of detail on how the scores were assigned. There is no mention of multiple evaluators or agreement rates, no test of whether the rubric actually catches hallucinations better than existing checks, and no clear justification for the weights in the index. The prompts were author-designed, so selection effects could explain some of the differences. Without those pieces, the reported HI values (around 0.5 to 0.7) are hard to interpret as solid evidence for the task-dependence conclusion. Readers who are deciding which LLM to use for drafting papers or checking references might find the numbers interesting as a starting point. It is not deep enough for someone building new evaluation methods or running large-scale studies. I think it deserves a serious referee if the authors expand the methods section with validation steps for the rubric and index. The core idea is reasonable, but the current version leaves too much on the evaluation process unaddressed.

Referee Report

3 major / 1 minor

Summary. The paper investigates hallucinations in four LLMs (ChatGPT, Grok, Gemini, Copilot) for academic writing tasks. It uses 80 author-designed prompts across four categories (reference generation, factual explanation, abstract generation, writing improvement), evaluates outputs via a 0-5 rubric on factual accuracy, reference validity, coherence, style, and tone, and introduces a novel weighted Hallucination Index (HI) to quantify hallucinations. Results show model-specific patterns (e.g., Grok/Copilot HI 0.67–0.70 on reference tasks but higher on abstracts; Gemini/ChatGPT HI 0.53–0.57 with better tone control but factual weaknesses), leading to the claim that hallucination depends on task type and prompting conditions, not solely on model architecture.

Significance. If the rubric and HI are shown to be reliable and bias-free, the work would be significant for NLP and AI-assisted scholarship by demonstrating that hallucination rates are task- and prompt-dependent. This could shift focus from model selection alone to prompt engineering and task-specific safeguards in academic applications, opening research on mitigation strategies tailored to writing subtasks.

major comments (3)

[Abstract/Methods] Abstract/Methods: The 0-5 rubric and its aggregation into the weighted Hallucination Index are central to all reported HI values and the task-dependence claim, yet no details are provided on rubric application criteria, number of evaluators, inter-rater agreement statistics, or how disagreements were resolved. Without these, observed HI spreads (e.g., 0.67 vs. 0.53) cannot be distinguished from evaluator subjectivity.
[Abstract/Results] Abstract/Results: The weights used in the novel Hallucination Index are unspecified (free parameters), with no ablation, correlation to established hallucination metrics (e.g., factuality benchmarks), or external validation against human fact-checking. This directly undermines attribution of HI differences to task/prompt factors rather than metric construction.
[Results] Results: No statistical tests, confidence intervals, or analysis of prompt sampling/balancing are reported for the 80 prompts or HI comparisons across the four categories. This leaves open whether differences reflect genuine task dependence or selection bias in the author-designed prompts.

minor comments (1)

[Abstract] Abstract: The sentence on evaluation metrics failing for sentiment-altering errors in machine-translated text is out of place and unrelated to the LLM academic-writing focus; it should be removed or replaced with a relevant limitation statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment point by point below, indicating where revisions will be made to improve transparency and rigor.

read point-by-point responses

Referee: [Abstract/Methods] The 0-5 rubric and its aggregation into the weighted Hallucination Index are central to all reported HI values and the task-dependence claim, yet no details are provided on rubric application criteria, number of evaluators, inter-rater agreement statistics, or how disagreements were resolved. Without these, observed HI spreads (e.g., 0.67 vs. 0.53) cannot be distinguished from evaluator subjectivity.

Authors: We acknowledge the need for greater transparency in the evaluation process. The 0-5 rubric was defined with explicit criteria for each dimension (e.g., factual accuracy scored 0 for major fabrications to 5 for fully verifiable content; reference validity from invalid citations to fully accurate). All 80 outputs were scored by the lead author for consistency. We will revise the Methods section to include the full rubric table with scoring anchors and explicitly state the single-evaluator design. This makes the process reproducible and allows readers to evaluate subjectivity. revision: yes
Referee: [Abstract/Results] The weights used in the novel Hallucination Index are unspecified (free parameters), with no ablation, correlation to established hallucination metrics (e.g., factuality benchmarks), or external validation against human fact-checking. This directly undermines attribution of HI differences to task/prompt factors rather than metric construction.

Authors: The HI is a weighted sum of the five rubric dimensions, with weights reflecting their relative priority in academic writing (factual accuracy and reference validity receiving higher emphasis). We will add the exact weights and their justification to the revised manuscript. We did not perform ablation or external validation in the original study, as the focus was on introducing the index for this task set; we will note this as a limitation and suggest it for future work rather than claiming robustness beyond the current scope. revision: partial
Referee: [Results] No statistical tests, confidence intervals, or analysis of prompt sampling/balancing are reported for the 80 prompts or HI comparisons across the four categories. This leaves open whether differences reflect genuine task dependence or selection bias in the author-designed prompts.

Authors: We agree this is a limitation of the current presentation. The 80 prompts were manually designed with 20 per category to represent common academic writing scenarios. In the revision we will report per-category means and standard deviations for HI scores and add a note on the author-designed, non-random prompt selection. Formal hypothesis testing is not feasible without a larger, independently sampled prompt corpus, but the consistent model-by-task patterns across the four categories still support our preliminary claim of task dependence. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivation chain

full rationale

The paper performs an empirical study: 80 author-designed prompts in four categories are fed to four LLMs, outputs are scored on a 0-5 rubric across five dimensions, and a weighted Hallucination Index is computed from those scores. No equations, first-principles derivations, fitted parameters, or uniqueness theorems are claimed. The central finding (hallucination rates vary by task and prompt type) is a direct summary of the observed score differences; it does not reduce to any self-referential definition or input by construction. The novel HI metric is introduced without any reported fitting or self-citation that would make later results tautological. This is a standard empirical measurement study whose conclusions stand or fall on the quality of the rubric and prompt set, not on any circular logical step.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that the custom rubric and weighted HI metric validly quantify hallucinations; the metric itself is introduced without external benchmarks or validation data.

free parameters (1)

Weights for Hallucination Index
HI is described as a novel weighted metric combining rubric dimensions; specific weights are not stated but must be chosen or fitted to produce the reported scores.

invented entities (1)

Hallucination Index (HI) no independent evidence
purpose: Composite score to quantify hallucination severity in LLM academic outputs
New metric created for this study with no independent evidence or prior validation mentioned.

pith-pipeline@v0.9.0 · 5580 in / 1311 out tokens · 57725 ms · 2026-05-08T17:50:07.545463+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination... we gave the highest weight to factual accuracy (4) and reference validity (3)... coherence a medium weight (2)... Style consistency and academic tone were given the lowest weight (1 each)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Hallucination augmented contrastive learning for multimodal large language model,

C. Jiang et al. , “Hallucination augmented contrastive learning for multimodal large language model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024

work page 2024
[2]

Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,

X. Guan et al. , “Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 16, 2024

work page 2024
[3]

Confabulation: The surprising value of large language model hallucinations,

P . Sui et al. , “Confabulation: The surprising value of large language model hallucinations,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2024

work page 2024
[4]

Faithfulness hallucination detection in healthcare AI,

P. R. Vishwanath et al. , “Faithfulness hallucination detection in healthcare AI,” in Artif. Intell. Data Sci. Healthcare: Bridging Data-Centric AI People-Centric Healthcare, 2024

work page 2024
[5]

On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,

X. Jing, S. Billa, and D. Godbout, “On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,” Findings Assoc. Comput. Linguistics: NAACL, 2025

work page 2025
[6]

Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,

M. Cao, Y . Dong, and J. C. K. Cheung, “Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2022

work page 2022
[7]

Today's academic research: The role of ChatGPT writing,

J. E. Chukwuere, “Today's academic research: The role of ChatGPT writing,” J. Inf. Syst. Informat., vol. 6, no. 1, pp. 30–46, 2024

work page 2024
[8]

ChatGPT in academic writing and publishing: A comprehensive guide,

M. Zohery, “ChatGPT in academic writing and publishing: A comprehensive guide,” Artif. Intell. Academia, Res. Sci.: ChatGPT Case Study, vol. 1, no. 5, 2023

work page 2023
[9]

ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,

G. F. Lendvai, “ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,” J. Empirical Res. Human Res. Ethics, p. 15562646251350203, 2025

work page 2022
[10]

Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,

A. M. Jarrah, Y . Wardat, and P . Fidalgo, “Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,” Online J. Commun. Media Technol., vol. 13, no. 4, p. e202346, 2023

work page 2023
[11]

Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,

E. Safran and A. Çalı, “Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,” Anatolian Current Med. J., vol. 7, no. 5, pp. 695–702, 2025

work page 2025
[12]

Review of reference generation methods in large language models,

P. Pattnayak et al., “Review of reference generation methods in large language models,” J. ID 9339, p. 1263, 2024. (Note: Journal details appear incomplete in source; formatted based on available data.)

work page 2024
[13]

Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,

B. Edelman and J. Skolnick, “Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,” BMC Bioinf., vol. 26, no. 1, p. 140, 2025

work page 2025
[14]

Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,

H. Pratama, “Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,” in Proc. Conf. English Lang. Teaching, 2025

work page 2025
[15]

An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,

E. J. Gantana et al., “An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,” Next Res., p. 101002, 2025

work page 2025
[16]

ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,

K. Sharun et al., “ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,” Ann. Med. Surgery, vol. 85, no. 10, pp. 5275–5278, 2023

work page 2023
[17]

‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,

D. P. Acut et al., “‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,” Internet Ref. Services Quart., vol. 29, no. 1, pp. 27–54, 2025

work page 2025
[18]

How trustworthy is ChatGPT? The case of bibliometric analyses,

F. Farhat et al., “How trustworthy is ChatGPT? The case of bibliometric analyses,” Cogent Engineering. V ol. 10, no.1, 2023

work page 2023
[19]

ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,

H. S. AlSagri et al., “ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,” Journal of Academic Ethics . V ol. 23, no.3, 2025

work page 2025
[20]

Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,

S. S. Sohail et al., “Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,” Journal of King Saud University-Computer and Information Sciences. V ol. 35. No. 8, 2023

work page 2023
[21]

Multimodal fine-tuning of LLMs for robust document visual question answering,

S. Tripathi, M. T. Nafis, I. Hussain, and A. K. J. Saudagar, “Multimodal fine-tuning of LLMs for robust document visual question answering,” *IEEE Access*, vol. 13, pp. 174611 –174623, 2025, doi: 10.1109/ACCESS.2025.3615201

work page doi:10.1109/access.2025.3615201 2025
[22]

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,

H. Saadany and C. Orăsan, “BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,” in Proceedings of the Translation and Interpreting Technology Online Conference (TRITON), INCOMA Ltd., 2021, pp. 48–56

work page 2021

[1] [1]

Hallucination augmented contrastive learning for multimodal large language model,

C. Jiang et al. , “Hallucination augmented contrastive learning for multimodal large language model,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024

work page 2024

[2] [2]

Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,

X. Guan et al. , “Mitigating large language model hallucinations via autonomous knowledge graph -based retrofitting,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 16, 2024

work page 2024

[3] [3]

Confabulation: The surprising value of large language model hallucinations,

P . Sui et al. , “Confabulation: The surprising value of large language model hallucinations,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2024

work page 2024

[4] [4]

Faithfulness hallucination detection in healthcare AI,

P. R. Vishwanath et al. , “Faithfulness hallucination detection in healthcare AI,” in Artif. Intell. Data Sci. Healthcare: Bridging Data-Centric AI People-Centric Healthcare, 2024

work page 2024

[5] [5]

On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,

X. Jing, S. Billa, and D. Godbout, “On a scale from 1 to 5: Quantifying hallucination in faithfulness evaluation,” Findings Assoc. Comput. Linguistics: NAACL, 2025

work page 2025

[6] [6]

Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,

M. Cao, Y . Dong, and J. C. K. Cheung, “Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), 2022

work page 2022

[7] [7]

Today's academic research: The role of ChatGPT writing,

J. E. Chukwuere, “Today's academic research: The role of ChatGPT writing,” J. Inf. Syst. Informat., vol. 6, no. 1, pp. 30–46, 2024

work page 2024

[8] [8]

ChatGPT in academic writing and publishing: A comprehensive guide,

M. Zohery, “ChatGPT in academic writing and publishing: A comprehensive guide,” Artif. Intell. Academia, Res. Sci.: ChatGPT Case Study, vol. 1, no. 5, 2023

work page 2023

[9] [9]

ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,

G. F. Lendvai, “ChatGPT in academic writing: A scientometric analysis of literature published between 2022 and 2023,” J. Empirical Res. Human Res. Ethics, p. 15562646251350203, 2025

work page 2022

[10] [10]

Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,

A. M. Jarrah, Y . Wardat, and P . Fidalgo, “Using ChatGPT in academic writing is (not) a form of plagiarism: What does the literature say,” Online J. Commun. Media Technol., vol. 13, no. 4, p. e202346, 2023

work page 2023

[11] [11]

Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,

E. Safran and A. Çalı, “Fabricated or accurate? Ethical concerns and citation hallucination in AI-generated scientific writing on musculoskeletal topics,” Anatolian Current Med. J., vol. 7, no. 5, pp. 695–702, 2025

work page 2025

[12] [12]

Review of reference generation methods in large language models,

P. Pattnayak et al., “Review of reference generation methods in large language models,” J. ID 9339, p. 1263, 2024. (Note: Journal details appear incomplete in source; formatted based on available data.)

work page 2024

[13] [13]

Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,

B. Edelman and J. Skolnick, “Valsci: An open -source, self -hostable literature review utility for automated large -batch scientific claim verification using large language models,” BMC Bioinf., vol. 26, no. 1, p. 140, 2025

work page 2025

[14] [14]

Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,

H. Pratama, “Training students to identify and correct fabricated references in ChatGPT-generated literature reviews,” in Proc. Conf. English Lang. Teaching, 2025

work page 2025

[15] [15]

An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,

E. J. Gantana et al., “An in -depth analysis of AI -generated scientific review articles and its potential implications on the future of medical journal publications,” Next Res., p. 101002, 2025

work page 2025

[16] [16]

ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,

K. Sharun et al., “ChatGPT and artificial hallucinations in stem cell research: Assessing the accuracy of generated references —A preliminary study,” Ann. Med. Surgery, vol. 85, no. 10, pp. 5275–5278, 2023

work page 2023

[17] [17]

‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,

D. P. Acut et al., “‘ChatGPT 4.0 ghosted us while conducting literature search:’ Modeling the chatbot’s generated non -existent references using regression analysis,” Internet Ref. Services Quart., vol. 29, no. 1, pp. 27–54, 2025

work page 2025

[18] [18]

How trustworthy is ChatGPT? The case of bibliometric analyses,

F. Farhat et al., “How trustworthy is ChatGPT? The case of bibliometric analyses,” Cogent Engineering. V ol. 10, no.1, 2023

work page 2023

[19] [19]

ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,

H. S. AlSagri et al., “ChatGPT or Gemini: Who Makes the Better Scientific Writing Assistant?,” Journal of Academic Ethics . V ol. 23, no.3, 2025

work page 2025

[20] [20]

Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,

S. S. Sohail et al., “Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions,” Journal of King Saud University-Computer and Information Sciences. V ol. 35. No. 8, 2023

work page 2023

[21] [21]

Multimodal fine-tuning of LLMs for robust document visual question answering,

S. Tripathi, M. T. Nafis, I. Hussain, and A. K. J. Saudagar, “Multimodal fine-tuning of LLMs for robust document visual question answering,” *IEEE Access*, vol. 13, pp. 174611 –174623, 2025, doi: 10.1109/ACCESS.2025.3615201

work page doi:10.1109/access.2025.3615201 2025

[22] [22]

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,

H. Saadany and C. Orăsan, “BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-Oriented Text,” in Proceedings of the Translation and Interpreting Technology Online Conference (TRITON), INCOMA Ltd., 2021, pp. 48–56

work page 2021