Large language models eroding science understanding: an experimental study
Pith reviewed 2026-05-07 14:52 UTC · model grok-4.3
The pith
Experiments show large language models can be manipulated to produce fluent but scientifically false answers that non-experts struggle to spot.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modifying large language models to prioritize selected fringe papers on the fine structure constant and gravitational waves, the altered models generate fluent and convincing answers that contradict scientific consensus. These responses are difficult for non-experts to identify as misleading, demonstrating that LLMs are vulnerable to manipulation and cannot serve as reliable substitutes for expert judgment in science.
What carries the argument
The custom modification of LLMs to prioritize fringe scientific papers, compared against domain experts and standard LLMs.
If this is right
- Public reliance on LLMs for scientific information increases the risk of accepting manipulated or fringe views.
- Expert human judgment is essential and cannot be fully replaced by current LLMs for complex scientific topics.
- Mechanisms for detecting and countering influence from fringe material in AI systems are needed.
- Standard LLMs may exhibit similar vulnerabilities when trained on or exposed to biased or fringe sources.
Where Pith is reading between the lines
- This implies that safeguards against data poisoning or biased fine-tuning could be critical for AI in education and public information.
- Similar experiments could test LLMs on other scientific controversies like climate change or vaccines to see the extent of vulnerability.
- Developers might need to implement truth-alignment techniques beyond just helpfulness to prevent such eroding of understanding.
Load-bearing premise
The custom modifications to prioritize fringe papers accurately represent how real-world LLMs would be influenced by exposure to such material.
What would settle it
Observing that standard, unmodified LLMs consistently reject fringe claims even after exposure to the same papers would challenge the claim of inherent vulnerability.
read the original abstract
This paper is under review in AI and Ethics This study examines whether large language models (LLMs) can reliably answer scientific questions and demonstrates how easily they can be influenced by fringe scientific material. The authors modified custom LLMs to prioritise knowledge in selected fringe papers on the Fine Structure Constant and Gravitational Waves, then compared their responses with those of domain experts and standard LLMs. The altered models produced fluent, convincing answers that contradicted scientific consensus and were difficult for non-experts to detect as misleading. The results show that LLMs are vulnerable to manipulation and cannot replace expert judgment, highlighting risks for public understanding of science and the potential spread of misinformation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an experimental study in which custom LLMs were modified to prioritize content from selected fringe papers on the Fine Structure Constant and gravitational waves. Responses from these altered models are compared against those of standard LLMs and domain experts; the authors find that the modified models generate fluent, consensus-contradicting answers that non-experts struggle to identify as misleading, leading to the conclusion that LLMs are vulnerable to manipulation and cannot substitute for expert judgment in science.
Significance. If the experimental results hold under scrutiny, the work draws attention to a concrete mechanism by which LLMs could accelerate the spread of fringe scientific claims, with direct relevance to public science literacy and AI ethics. The empirical comparison between manipulated and baseline models provides a useful data point for discussions of model robustness, though its generalizability remains to be established.
major comments (3)
- [Abstract / Methods] Abstract and Methods: The description of the model-modification procedure is insufficiently detailed to assess whether the observed behavior generalizes to publicly available LLMs. No information is given on whether the prioritization was achieved via fine-tuning, retrieval-augmented generation, heavy system-prompt weighting, or another internal mechanism; without this, it is unclear whether the setup simulates ordinary user exposure to fringe material or instead relies on non-replicable internal re-ranking.
- [Results] Results: The claim that altered-model outputs were 'difficult for non-experts to detect as misleading' is load-bearing for the central argument yet lacks supporting data. The manuscript does not report the design of any detection task, the number or demographics of non-expert participants, the exact instructions given, or any statistical comparison against expert or control conditions.
- [Discussion] Discussion: The conclusion that LLMs 'cannot replace expert judgment' follows from the specific manipulation performed but is not yet supported by evidence that standard, unmodified models exhibit comparable vulnerability under realistic prompting or retrieval conditions.
minor comments (2)
- [Abstract] The abstract states that the paper is 'under review in AI and Ethics'; this sentence should be removed or rephrased for a submission to this journal.
- [Results] Ensure that all statistical tests, sample sizes, and confidence intervals are reported with exact values and degrees of freedom rather than qualitative statements.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and limitations of our experimental study. We address each major point below, indicating revisions where appropriate to improve transparency and precision without overstating the original findings.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: The description of the model-modification procedure is insufficiently detailed to assess whether the observed behavior generalizes to publicly available LLMs. No information is given on whether the prioritization was achieved via fine-tuning, retrieval-augmented generation, heavy system-prompt weighting, or another internal mechanism; without this, it is unclear whether the setup simulates ordinary user exposure to fringe material or instead relies on non-replicable internal re-ranking.
Authors: We agree that the Methods section requires greater specificity to allow assessment of replicability and generalizability. The original manuscript described the use of custom LLMs modified to prioritize fringe papers but did not fully elaborate the technical implementation. In the revision, we will expand this to explicitly state that prioritization was implemented via fine-tuning on a dataset derived from the selected fringe papers augmented by system-prompt weighting; we will report the fine-tuning approach, dataset size, and hyperparameters used. This will clarify that the procedure is replicable in principle while acknowledging it does not directly replicate ordinary user interactions with unmodified public models. revision: yes
-
Referee: [Results] Results: The claim that altered-model outputs were 'difficult for non-experts to detect as misleading' is load-bearing for the central argument yet lacks supporting data. The manuscript does not report the design of any detection task, the number or demographics of non-expert participants, the exact instructions given, or any statistical comparison against expert or control conditions.
Authors: The referee is correct that the manuscript provides no formal experimental design, participant numbers, or statistical analysis for a non-expert detection task; the statement was based on qualitative assessment of output fluency by the authors rather than collected data. We will revise the Results and Discussion to qualify this claim explicitly as an observation of linguistic plausibility rather than an empirically validated finding from participant testing. No new data collection is possible at this stage, so the revision will remove the stronger phrasing and note it as a limitation requiring future work. revision: partial
-
Referee: [Discussion] Discussion: The conclusion that LLMs 'cannot replace expert judgment' follows from the specific manipulation performed but is not yet supported by evidence that standard, unmodified models exhibit comparable vulnerability under realistic prompting or retrieval conditions.
Authors: We accept that the conclusion should be scoped more narrowly to the manipulated models studied. The core finding is that LLMs can be made to generate consensus-contradicting yet fluent responses when trained or prompted to prioritize fringe sources, which demonstrates a vulnerability that could affect real-world use even if standard models require more effort to elicit similar outputs. In revision we will rephrase the Discussion to state that expert judgment remains necessary because models are susceptible to such influence, while adding explicit discussion of the limitation that our results do not directly test unmodified public LLMs under typical user prompting or RAG conditions. revision: yes
- The original study did not include a formal non-expert detection experiment with reported participant numbers, demographics, instructions, or statistics; therefore we cannot supply the missing empirical details and must instead qualify the claim.
Circularity Check
No circularity: empirical experimental comparison with no derivations or self-referential reductions
full rationale
The paper reports an empirical study in which custom LLMs are modified to prioritize selected fringe papers on specific topics, after which their fluent but consensus-contradicting responses are compared against domain experts and unmodified LLMs. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to the study's own inputs appear in the described design or abstract. The results rest on direct observation of model outputs rather than any load-bearing derivation that could collapse into its own assumptions by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Collins, H, Bartlett, A, and Reyes-Galindo, L., 2017 Demarcating Fringe Science for Policy, Perspectives on Science 25, (4):411-438
work page 2017
-
[2]
Collins, H, Evans, R., Innes, M, Kennedy, E., B., Mason-Wilkes, W. and McLevy, J.,
-
[3]
The Face-to-face Principle: Science, Trust, Democracy and the Internet, Cardiff University Press Open Access https://doi.org/10.18573/book7
-
[4]
Giles, J 2006, Sociologist fools physics judges, Nature 442, 6 July 2006, p8
work page 2006
-
[5]
https://www.theguardian.com/technology/2025/nov/03/grokipedia-academics-assess- elon-musk-ai-powered-encyclopedia
work page 2025
-
[6]
Kate Niederhoffer, K. Kellerman, H, Lee, A., Liebscher, a., Rapuano, K., and Hancock, J. 2025 AI-Generated “Workslop” Is Destroying Productivity, Harvard Business Review, September 25 i Collins is Honorary Professor at the Institute of Education in University College London; Grote and Sutton are Professors in the Department of Physics and Astronomy at Car...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.