Are LLMs More Skeptical of Entertainment News?
Pith reviewed 2026-05-10 16:24 UTC · model grok-4.3
The pith
Some large language models misclassify legitimate entertainment news as fake at higher rates than hard news.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Zero-shot LLMs display a model-specific genre asymmetry in news credibility assessment, where two models exhibit false-positive-rate gaps of 10.1 and 8.8 percentage points between entertainment and hard news (both p < .001), but the other two show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for one model by about 50% without detectable recall loss, but offers little improvement for the other. Exploratory qualitative coding identifies two
What carries the argument
The within-dataset comparison of false-positive rates on legitimate articles from the GossipCop portion of FakeNewsNet, which isolates genre effects in zero-shot LLM credibility judgments.
Load-bearing premise
The within-dataset design on GossipCop sufficiently isolates genre effects from confounding differences in topic, source, or unverifiability of private-life claims between entertainment and hard news.
What would settle it
A replication on a new dataset with entertainment and hard news articles matched on topic, source, and claim verifiability, showing no significant false-positive gaps for the affected models, would falsify the asymmetry claim.
Figures
read the original abstract
Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines whether zero-shot LLMs apply even-handed credibility standards across journalistic genres by testing for higher false-positive rates on legitimate entertainment news than on legitimate hard news. It employs a within-dataset design on the GossipCop subset of FakeNewsNet, reports model-specific FPR gaps of 10.1 pp (DeepSeek-V3.2) and 8.8 pp (GPT-5.2) with p < .001, finds no comparable gaps for Claude Opus 4.6 and Gemini 3 Flash, and supplements these with style-swap experiments, prompt-based mitigation trials, and qualitative coding of error patterns in false positives.
Significance. If the genre asymmetry were demonstrated with a design that isolates journalistic genre from topic, source, and claim-type confounds, the result would be significant for AI-assisted misinformation detection: it would show that aggregate accuracy metrics can mask structured biases and that evaluation protocols should incorporate genre-stratified FPR analysis. The model-specific mitigation findings and qualitative error patterns (private-life unverifiability and genre legitimacy discounting) would further inform prompt engineering and bias auditing practices.
major comments (2)
- [Abstract] Abstract and Methods: The central claim requires a within-dataset contrast between legitimate entertainment news and legitimate hard news to compute the reported FPR gaps. GossipCop contains only celebrity-gossip and entertainment articles (real and fake); it supplies no hard-news items. Consequently the 10.1 pp and 8.8 pp differences cannot be obtained inside the stated design, and any cross-dataset supplementation would introduce uncontrolled differences in topic, labeling process, and claim verifiability that the within-dataset framing is intended to eliminate.
- [Methods] Methods (prompting and sample construction): The manuscript does not provide the exact prompt templates, the precise definition of “legitimate” items used for FPR calculation, or explicit controls for topic/claim-type differences between the entertainment samples and any hard-news samples. These omissions prevent evaluation of whether the reported p-values reflect genre effects or residual confounds.
minor comments (2)
- [Results] Table and figure captions should explicitly state the number of legitimate items per genre and per model so that the FPR denominators are transparent.
- [Qualitative Analysis] The qualitative coding section would benefit from an inter-coder agreement statistic and a clearer description of the sampling frame for the false-positive instances examined.
Simulated Author's Rebuttal
Thank you for the detailed referee report and the opportunity to address these important points. We respond to each major comment below and will make corresponding revisions to improve clarity, reproducibility, and acknowledgment of limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: The central claim requires a within-dataset contrast between legitimate entertainment news and legitimate hard news to compute the reported FPR gaps. GossipCop contains only celebrity-gossip and entertainment articles (real and fake); it supplies no hard-news items. Consequently the 10.1 pp and 8.8 pp differences cannot be obtained inside the stated design, and any cross-dataset supplementation would introduce uncontrolled differences in topic, labeling process, and claim verifiability that the within-dataset framing is intended to eliminate.
Authors: We thank the referee for identifying this inconsistency. The abstract's characterization of the design as a 'within-dataset design on GossipCop' is imprecise and incorrect. The entertainment samples are drawn from the GossipCop subset of FakeNewsNet, while the hard-news samples come from the Politifact subset. This cross-dataset comparison does introduce potential confounds in topic, source, labeling processes, and claim verifiability, as the referee notes. We will revise the abstract, methods, and discussion sections to accurately describe the data sources, remove the 'within-dataset' phrasing, and explicitly discuss these limitations and their implications for interpreting the FPR gaps. We believe the model-specific patterns remain noteworthy but will present the results with appropriate caveats rather than claiming isolation of genre effects. revision: yes
-
Referee: [Methods] Methods (prompting and sample construction): The manuscript does not provide the exact prompt templates, the precise definition of “legitimate” items used for FPR calculation, or explicit controls for topic/claim-type differences between the entertainment samples and any hard-news samples. These omissions prevent evaluation of whether the reported p-values reflect genre effects or residual confounds.
Authors: We agree that these omissions limit reproducibility and make it harder to rule out confounds. In the revised manuscript we will add the exact prompt templates used for each model, a precise definition of 'legitimate' items (those labeled real in the source datasets), and an expanded methods subsection that discusses topic and claim-type differences between the GossipCop and Politifact samples. We will also elaborate on the style-swap experiments as a partial attempt to address stylistic and topical variation. These additions should allow readers to better assess the strength of the genre-asymmetry findings. revision: yes
Circularity Check
No circularity: empirical reporting on external dataset
full rationale
The paper reports observed false-positive rates and statistical tests on the public GossipCop subset of FakeNewsNet. No derivation, equation, or central claim reduces to its own inputs by construction. There are no fitted parameters presented as predictions, no self-citation load-bearing premises, and no ansatz or uniqueness theorems invoked. The analysis is self-contained against the external benchmark dataset and standard hypothesis testing; any limitations in dataset coverage (e.g., genre composition) affect validity but do not create circularity in the reported chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption GossipCop dataset labels accurately distinguish real from fake news
- domain assumption Zero-shot prompting produces model outputs representative of general behavior on news credibility tasks
Reference graph
Works this paper leans on
-
[1]
SoK: Machine Learning for Misinformation Detec- tion.arXiv preprint arXiv:2308.12215. Horne, B.; and Adali, S. 2017. This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body, More Similar To Satire Than Real News.Pro- ceedings of the International AAAI Conference on Web and Social Media, 11(1): 759–766. Hu, B.; Sheng, Q....
-
[2]
Pelrine, K.; Mosber, A.; Zheng, J.; Yang, J.-Y .; Peng, A.; Rabbany, R.; and Cheung, J
A Survey on the Use of Large Language Models (LLMs) in Fake News.Future Internet, 16(8). Pelrine, K.; Mosber, A.; Zheng, J.; Yang, J.-Y .; Peng, A.; Rabbany, R.; and Cheung, J. C. K. 2023. Towards Re- liable Misinformation Mitigation: Generalization, Uncer- tainty, and GPT-4. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pr...
work page 2023
-
[3]
P´erez-Rosas, V .; Kleinberg, B.; Lefevre, A.; and Mihalcea, R
Association for Computational Linguistics. P´erez-Rosas, V .; Kleinberg, B.; Lefevre, A.; and Mihalcea, R. 2018. Automatic Detection of Fake News. InProceed- ings of the 27th International Conference on Computational Linguistics, 3391–3401. Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; and Stein, B. 2018. A Stylometric Inquiry into Hyperpartisan...
work page 2018
-
[4]
Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In Palmer, M.; Hwa, R.; and Riedel, S., eds.,Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2931–2937. Copenhagen, Denmark: Association for Com- putational Linguistics. Reinemann, C.; Stanyer, J.; Scherr, S.; and Legnante, G
work page 2017
-
[5]
Hard and Soft News: A Review of Concepts, Oper- ationalizations and Key Findings.Journalism, 13(2): 221– 239. Roberts, S. T. 2019.Behind the screen : content modera- tion in the shadows of social media. New Haven, CT: Yale University Press. ISBN 0-300-24531-9. Schuster, T.; Schuster, R.; Shah, D. J.; and Barzilay, R. 2020. The Limitations of Stylometry fo...
work page 2019
-
[6]
Silva, A.; Han, L.; Karunasekera, S.; and Leckie, C
FakeNewsNet: A Data Repository with News Con- tent, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media.Big Data, 8(3): 171– 188. Silva, A.; Han, L.; Karunasekera, S.; and Leckie, C. 2021a. Embracing Domain Differences in Fake News: Cross- domain Fake News Detection using Multi-modal Data. In Proceedings of the AAAI Confe...
work page 2015
-
[7]
Fake News Detectors are Biased against Texts Gener- ated by Large Language Models. arXiv:2309.08674. Sundar, S. S. 2020. Rise of Machine Agency: A Frame- work for Studying the Psychology of Human–AI Interac- tion (HAII).Journal of Computer-Mediated Communica- tion, 25(1): 74–88. Tahmasebi, S.; et al. 2026. Robust Fake News Detection using Large Language M...
-
[8]
Comparing GPT-4 and Open-Source Language Mod- els in Misinformation Mitigation. arXiv:2401.06920. Wan, Y .; Wang, X.; Gao, W.; He, J.; and Huang, M
-
[9]
InAdvances in Neural Information Processing Systems, volume 38
Truth over Tricks: Measuring and Mitigating Short- cut Learning in Misinformation Detection. InAdvances in Neural Information Processing Systems, volume 38. Wu, J.; Guo, J.; and Hooi, B. 2023. Fake News in Sheep’s Clothing: Robust Fake News Detection Against LLM-Empowered Style Attacks.arXiv preprint arXiv:2310.10830. Zhao, J.; Guan, Z.; Xu, C.; Zhao, W.;...
-
[10]
HARD NEWS - Factual, neutral tone; who/what/when/where/why structure; minimal evaluative language
-
[11]
ENTERTAINMENT GOSSIP - Exaggerated, emotionally vivid, hyperbolic; celebrity/personal focus; designed to entertain
-
[12]
OPINION EDITORIAL - Explicit subjective stance; passionate critique; persuasive intent
-
[13]
PROMOTIONAL - Reports on marketing/PR events; promotional language; brand partnerships Article:{text} Respond with ONLY a JSON object:{"genre": "...", "confidence": 0.0-1.0} A.3 Coding Instructions The four genre categories were defined as follows, along with illustrative examples: •ENTERTAINMENT GOSSIP: Celebrity news, relationship updates, lifestyle cov...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.