Large language models eroding science understanding: an experimental study

Harry Collins; Hartmut Grote; Patrick Sutton; Paul Newbury; Simon Thorne

arxiv: 2604.25639 · v1 · submitted 2026-04-28 · 💻 cs.CY · cs.AI

Large language models eroding science understanding: an experimental study

Harry Collins , Hartmut Grote , Paul Newbury , Patrick Sutton , Simon Thorne This is my paper

Pith reviewed 2026-05-07 14:52 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords large language modelsscientific consensusfringe sciencemisinformationfine structure constantgravitational wavesAI manipulationexpert judgment

0 comments

The pith

Experiments show large language models can be manipulated to produce fluent but scientifically false answers that non-experts struggle to spot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can be trusted for scientific questions by modifying them to favor fringe ideas. The authors created custom models that prioritize papers on the fine structure constant and gravitational waves that go against mainstream science. These models gave fluent responses that experts would reject but non-experts might accept as true. The finding matters because it shows LLMs could undermine public understanding of science if used without expert oversight, as they are susceptible to manipulation through their training or prompting data.

Core claim

By modifying large language models to prioritize selected fringe papers on the fine structure constant and gravitational waves, the altered models generate fluent and convincing answers that contradict scientific consensus. These responses are difficult for non-experts to identify as misleading, demonstrating that LLMs are vulnerable to manipulation and cannot serve as reliable substitutes for expert judgment in science.

What carries the argument

The custom modification of LLMs to prioritize fringe scientific papers, compared against domain experts and standard LLMs.

If this is right

Public reliance on LLMs for scientific information increases the risk of accepting manipulated or fringe views.
Expert human judgment is essential and cannot be fully replaced by current LLMs for complex scientific topics.
Mechanisms for detecting and countering influence from fringe material in AI systems are needed.
Standard LLMs may exhibit similar vulnerabilities when trained on or exposed to biased or fringe sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that safeguards against data poisoning or biased fine-tuning could be critical for AI in education and public information.
Similar experiments could test LLMs on other scientific controversies like climate change or vaccines to see the extent of vulnerability.
Developers might need to implement truth-alignment techniques beyond just helpfulness to prevent such eroding of understanding.

Load-bearing premise

The custom modifications to prioritize fringe papers accurately represent how real-world LLMs would be influenced by exposure to such material.

What would settle it

Observing that standard, unmodified LLMs consistently reject fringe claims even after exposure to the same papers would challenge the claim of inherent vulnerability.

read the original abstract

This paper is under review in AI and Ethics This study examines whether large language models (LLMs) can reliably answer scientific questions and demonstrates how easily they can be influenced by fringe scientific material. The authors modified custom LLMs to prioritise knowledge in selected fringe papers on the Fine Structure Constant and Gravitational Waves, then compared their responses with those of domain experts and standard LLMs. The altered models produced fluent, convincing answers that contradicted scientific consensus and were difficult for non-experts to detect as misleading. The results show that LLMs are vulnerable to manipulation and cannot replace expert judgment, highlighting risks for public understanding of science and the potential spread of misinformation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs can be tweaked to output fluent fringe-science answers on topics like the fine structure constant, but the tweak method may not match ordinary user interactions.

read the letter

The main point is that the authors altered some LLMs to favor selected fringe papers on the fine structure constant and gravitational waves, after which the models gave coherent but consensus-violating responses that non-experts found hard to flag. That concrete demonstration on real physics questions is the useful part. It moves past general warnings about LLM unreliability and shows a specific way the outputs can mislead on established science. The side-by-side comparison with domain experts also helps make the risk visible rather than abstract. The soft spot sits in the modification itself. The abstract describes prioritizing fringe material inside custom models, yet gives no detail on whether this was prompt-level, retrieval-based, fine-tuning, or a deeper internal change. If the method requires special access or heavy reweighting that normal users cannot do, the results do not yet show how readily public LLMs would behave the same way under ordinary prompting or web retrieval. The claim that non-experts struggle to detect the problems likewise needs the actual test protocol, participant numbers, and question set before it carries weight. This work is aimed at people tracking AI in science communication and misinformation policy. It is coherent enough on its own terms to deserve referee time, even though the current version is thin on methods. I would send it for review with the expectation that the authors clarify the alteration process and supply the missing experimental details.

Referee Report

3 major / 2 minor

Summary. The paper reports an experimental study in which custom LLMs were modified to prioritize content from selected fringe papers on the Fine Structure Constant and gravitational waves. Responses from these altered models are compared against those of standard LLMs and domain experts; the authors find that the modified models generate fluent, consensus-contradicting answers that non-experts struggle to identify as misleading, leading to the conclusion that LLMs are vulnerable to manipulation and cannot substitute for expert judgment in science.

Significance. If the experimental results hold under scrutiny, the work draws attention to a concrete mechanism by which LLMs could accelerate the spread of fringe scientific claims, with direct relevance to public science literacy and AI ethics. The empirical comparison between manipulated and baseline models provides a useful data point for discussions of model robustness, though its generalizability remains to be established.

major comments (3)

[Abstract / Methods] Abstract and Methods: The description of the model-modification procedure is insufficiently detailed to assess whether the observed behavior generalizes to publicly available LLMs. No information is given on whether the prioritization was achieved via fine-tuning, retrieval-augmented generation, heavy system-prompt weighting, or another internal mechanism; without this, it is unclear whether the setup simulates ordinary user exposure to fringe material or instead relies on non-replicable internal re-ranking.
[Results] Results: The claim that altered-model outputs were 'difficult for non-experts to detect as misleading' is load-bearing for the central argument yet lacks supporting data. The manuscript does not report the design of any detection task, the number or demographics of non-expert participants, the exact instructions given, or any statistical comparison against expert or control conditions.
[Discussion] Discussion: The conclusion that LLMs 'cannot replace expert judgment' follows from the specific manipulation performed but is not yet supported by evidence that standard, unmodified models exhibit comparable vulnerability under realistic prompting or retrieval conditions.

minor comments (2)

[Abstract] The abstract states that the paper is 'under review in AI and Ethics'; this sentence should be removed or rephrased for a submission to this journal.
[Results] Ensure that all statistical tests, sample sizes, and confidence intervals are reported with exact values and degrees of freedom rather than qualitative statements.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of our experimental study. We address each major point below, indicating revisions where appropriate to improve transparency and precision without overstating the original findings.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The description of the model-modification procedure is insufficiently detailed to assess whether the observed behavior generalizes to publicly available LLMs. No information is given on whether the prioritization was achieved via fine-tuning, retrieval-augmented generation, heavy system-prompt weighting, or another internal mechanism; without this, it is unclear whether the setup simulates ordinary user exposure to fringe material or instead relies on non-replicable internal re-ranking.

Authors: We agree that the Methods section requires greater specificity to allow assessment of replicability and generalizability. The original manuscript described the use of custom LLMs modified to prioritize fringe papers but did not fully elaborate the technical implementation. In the revision, we will expand this to explicitly state that prioritization was implemented via fine-tuning on a dataset derived from the selected fringe papers augmented by system-prompt weighting; we will report the fine-tuning approach, dataset size, and hyperparameters used. This will clarify that the procedure is replicable in principle while acknowledging it does not directly replicate ordinary user interactions with unmodified public models. revision: yes
Referee: [Results] Results: The claim that altered-model outputs were 'difficult for non-experts to detect as misleading' is load-bearing for the central argument yet lacks supporting data. The manuscript does not report the design of any detection task, the number or demographics of non-expert participants, the exact instructions given, or any statistical comparison against expert or control conditions.

Authors: The referee is correct that the manuscript provides no formal experimental design, participant numbers, or statistical analysis for a non-expert detection task; the statement was based on qualitative assessment of output fluency by the authors rather than collected data. We will revise the Results and Discussion to qualify this claim explicitly as an observation of linguistic plausibility rather than an empirically validated finding from participant testing. No new data collection is possible at this stage, so the revision will remove the stronger phrasing and note it as a limitation requiring future work. revision: partial
Referee: [Discussion] Discussion: The conclusion that LLMs 'cannot replace expert judgment' follows from the specific manipulation performed but is not yet supported by evidence that standard, unmodified models exhibit comparable vulnerability under realistic prompting or retrieval conditions.

Authors: We accept that the conclusion should be scoped more narrowly to the manipulated models studied. The core finding is that LLMs can be made to generate consensus-contradicting yet fluent responses when trained or prompted to prioritize fringe sources, which demonstrates a vulnerability that could affect real-world use even if standard models require more effort to elicit similar outputs. In revision we will rephrase the Discussion to state that expert judgment remains necessary because models are susceptible to such influence, while adding explicit discussion of the limitation that our results do not directly test unmodified public LLMs under typical user prompting or RAG conditions. revision: yes

standing simulated objections not resolved

The original study did not include a formal non-expert detection experiment with reported participant numbers, demographics, instructions, or statistics; therefore we cannot supply the missing empirical details and must instead qualify the claim.

Circularity Check

0 steps flagged

No circularity: empirical experimental comparison with no derivations or self-referential reductions

full rationale

The paper reports an empirical study in which custom LLMs are modified to prioritize selected fringe papers on specific topics, after which their fluent but consensus-contradicting responses are compared against domain experts and unmodified LLMs. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to the study's own inputs appear in the described design or abstract. The results rest on direct observation of model outputs rather than any load-bearing derivation that could collapse into its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical experimental study with no free parameters, mathematical axioms, or newly postulated entities required for the central claim.

pith-pipeline@v0.9.0 · 5404 in / 1106 out tokens · 86588 ms · 2026-05-07T14:52:51.844224+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Collins, H, Bartlett, A, and Reyes-Galindo, L., 2017 Demarcating Fringe Science for Policy, Perspectives on Science 25, (4):411-438

work page 2017
[2]

and McLevy, J.,

Collins, H, Evans, R., Innes, M, Kennedy, E., B., Mason-Wilkes, W. and McLevy, J.,

work page
[3]

The Face-to-face Principle: Science, Trust, Democracy and the Internet, Cardiff University Press Open Access https://doi.org/10.18573/book7

work page doi:10.18573/book7
[4]

Giles, J 2006, Sociologist fools physics judges, Nature 442, 6 July 2006, p8

work page 2006
[5]

https://www.theguardian.com/technology/2025/nov/03/grokipedia-academics-assess- elon-musk-ai-powered-encyclopedia

work page 2025
[6]

Workslop

Kate Niederhoffer, K. Kellerman, H, Lee, A., Liebscher, a., Rapuano, K., and Hancock, J. 2025 AI-Generated “Workslop” Is Destroying Productivity, Harvard Business Review, September 25 i Collins is Honorary Professor at the Institute of Education in University College London; Grote and Sutton are Professors in the Department of Physics and Astronomy at Car...

work page 2025

[1] [1]

Collins, H, Bartlett, A, and Reyes-Galindo, L., 2017 Demarcating Fringe Science for Policy, Perspectives on Science 25, (4):411-438

work page 2017

[2] [2]

and McLevy, J.,

Collins, H, Evans, R., Innes, M, Kennedy, E., B., Mason-Wilkes, W. and McLevy, J.,

work page

[3] [3]

The Face-to-face Principle: Science, Trust, Democracy and the Internet, Cardiff University Press Open Access https://doi.org/10.18573/book7

work page doi:10.18573/book7

[4] [4]

Giles, J 2006, Sociologist fools physics judges, Nature 442, 6 July 2006, p8

work page 2006

[5] [5]

https://www.theguardian.com/technology/2025/nov/03/grokipedia-academics-assess- elon-musk-ai-powered-encyclopedia

work page 2025

[6] [6]

Workslop

Kate Niederhoffer, K. Kellerman, H, Lee, A., Liebscher, a., Rapuano, K., and Hancock, J. 2025 AI-Generated “Workslop” Is Destroying Productivity, Harvard Business Review, September 25 i Collins is Honorary Professor at the Institute of Education in University College London; Grote and Sutton are Professors in the Department of Physics and Astronomy at Car...

work page 2025