pith. sign in

arxiv: 2605.01727 · v1 · submitted 2026-05-03 · 💻 cs.AI · cs.CY

Are LLMs More Skeptical of Entertainment News?

Pith reviewed 2026-05-10 16:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CY
keywords LLMsnews credibility assessmentgenre asymmetryfalse positive ratesentertainment newszero-shot evaluationGossipCopFakeNewsNet
0
0 comments X

The pith

Some large language models misclassify legitimate entertainment news as fake at higher rates than hard news.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether zero-shot LLMs apply the same standards when judging the credibility of entertainment news versus hard news. It uses a dataset of real and fake articles from GossipCop to measure false-positive rates across four models in a zero-shot setting. Two models show significantly higher rates of labeling real entertainment stories as fake, by about 9 to 10 percentage points, while the other two do not. Style-swap experiments and prompt adjustments indicate the bias is not just about writing style and can be reduced in some cases through targeted instructions. This suggests that overall accuracy numbers can hide systematic differences in how models treat different types of journalism.

Core claim

Zero-shot LLMs display a model-specific genre asymmetry in news credibility assessment, where two models exhibit false-positive-rate gaps of 10.1 and 8.8 percentage points between entertainment and hard news (both p < .001), but the other two show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for one model by about 50% without detectable recall loss, but offers little improvement for the other. Exploratory qualitative coding identifies two

What carries the argument

The within-dataset comparison of false-positive rates on legitimate articles from the GossipCop portion of FakeNewsNet, which isolates genre effects in zero-shot LLM credibility judgments.

Load-bearing premise

The within-dataset design on GossipCop sufficiently isolates genre effects from confounding differences in topic, source, or unverifiability of private-life claims between entertainment and hard news.

What would settle it

A replication on a new dataset with entertainment and hard news articles matched on topic, source, and claim verifiability, showing no significant false-positive gaps for the affected models, would falsify the asymmetry claim.

Figures

Figures reproduced from arXiv: 2605.01727 by Huiqian Lai.

Figure 1
Figure 1. Figure 1: False-positive rates for real entertainment-gossip [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fake rates before and after rewriting legitimate en [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines whether zero-shot LLMs apply even-handed credibility standards across journalistic genres by testing for higher false-positive rates on legitimate entertainment news than on legitimate hard news. It employs a within-dataset design on the GossipCop subset of FakeNewsNet, reports model-specific FPR gaps of 10.1 pp (DeepSeek-V3.2) and 8.8 pp (GPT-5.2) with p < .001, finds no comparable gaps for Claude Opus 4.6 and Gemini 3 Flash, and supplements these with style-swap experiments, prompt-based mitigation trials, and qualitative coding of error patterns in false positives.

Significance. If the genre asymmetry were demonstrated with a design that isolates journalistic genre from topic, source, and claim-type confounds, the result would be significant for AI-assisted misinformation detection: it would show that aggregate accuracy metrics can mask structured biases and that evaluation protocols should incorporate genre-stratified FPR analysis. The model-specific mitigation findings and qualitative error patterns (private-life unverifiability and genre legitimacy discounting) would further inform prompt engineering and bias auditing practices.

major comments (2)
  1. [Abstract] Abstract and Methods: The central claim requires a within-dataset contrast between legitimate entertainment news and legitimate hard news to compute the reported FPR gaps. GossipCop contains only celebrity-gossip and entertainment articles (real and fake); it supplies no hard-news items. Consequently the 10.1 pp and 8.8 pp differences cannot be obtained inside the stated design, and any cross-dataset supplementation would introduce uncontrolled differences in topic, labeling process, and claim verifiability that the within-dataset framing is intended to eliminate.
  2. [Methods] Methods (prompting and sample construction): The manuscript does not provide the exact prompt templates, the precise definition of “legitimate” items used for FPR calculation, or explicit controls for topic/claim-type differences between the entertainment samples and any hard-news samples. These omissions prevent evaluation of whether the reported p-values reflect genre effects or residual confounds.
minor comments (2)
  1. [Results] Table and figure captions should explicitly state the number of legitimate items per genre and per model so that the FPR denominators are transparent.
  2. [Qualitative Analysis] The qualitative coding section would benefit from an inter-coder agreement statistic and a clearer description of the sampling frame for the false-positive instances examined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed referee report and the opportunity to address these important points. We respond to each major comment below and will make corresponding revisions to improve clarity, reproducibility, and acknowledgment of limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: The central claim requires a within-dataset contrast between legitimate entertainment news and legitimate hard news to compute the reported FPR gaps. GossipCop contains only celebrity-gossip and entertainment articles (real and fake); it supplies no hard-news items. Consequently the 10.1 pp and 8.8 pp differences cannot be obtained inside the stated design, and any cross-dataset supplementation would introduce uncontrolled differences in topic, labeling process, and claim verifiability that the within-dataset framing is intended to eliminate.

    Authors: We thank the referee for identifying this inconsistency. The abstract's characterization of the design as a 'within-dataset design on GossipCop' is imprecise and incorrect. The entertainment samples are drawn from the GossipCop subset of FakeNewsNet, while the hard-news samples come from the Politifact subset. This cross-dataset comparison does introduce potential confounds in topic, source, labeling processes, and claim verifiability, as the referee notes. We will revise the abstract, methods, and discussion sections to accurately describe the data sources, remove the 'within-dataset' phrasing, and explicitly discuss these limitations and their implications for interpreting the FPR gaps. We believe the model-specific patterns remain noteworthy but will present the results with appropriate caveats rather than claiming isolation of genre effects. revision: yes

  2. Referee: [Methods] Methods (prompting and sample construction): The manuscript does not provide the exact prompt templates, the precise definition of “legitimate” items used for FPR calculation, or explicit controls for topic/claim-type differences between the entertainment samples and any hard-news samples. These omissions prevent evaluation of whether the reported p-values reflect genre effects or residual confounds.

    Authors: We agree that these omissions limit reproducibility and make it harder to rule out confounds. In the revised manuscript we will add the exact prompt templates used for each model, a precise definition of 'legitimate' items (those labeled real in the source datasets), and an expanded methods subsection that discusses topic and claim-type differences between the GossipCop and Politifact samples. We will also elaborate on the style-swap experiments as a partial attempt to address stylistic and topical variation. These additions should allow readers to better assess the strength of the genre-asymmetry findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reporting on external dataset

full rationale

The paper reports observed false-positive rates and statistical tests on the public GossipCop subset of FakeNewsNet. No derivation, equation, or central claim reduces to its own inputs by construction. There are no fitted parameters presented as predictions, no self-citation load-bearing premises, and no ansatz or uniqueness theorems invoked. The analysis is self-contained against the external benchmark dataset and standard hypothesis testing; any limitations in dataset coverage (e.g., genre composition) affect validity but do not create circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the accuracy of dataset labels as ground truth and the assumption that the experimental setup isolates genre recognition from other variables.

axioms (2)
  • domain assumption GossipCop dataset labels accurately distinguish real from fake news
    Used as ground truth for calculating false positive rates across genres.
  • domain assumption Zero-shot prompting produces model outputs representative of general behavior on news credibility tasks
    Basis for the main experiments and style-swap tests.

pith-pipeline@v0.9.0 · 5566 in / 1397 out tokens · 78543 ms · 2026-05-10T16:24:57.696423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    soft” and “hard

    SoK: Machine Learning for Misinformation Detec- tion.arXiv preprint arXiv:2308.12215. Horne, B.; and Adali, S. 2017. This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body, More Similar To Satire Than Real News.Pro- ceedings of the International AAAI Conference on Web and Social Media, 11(1): 759–766. Hu, B.; Sheng, Q....

  2. [2]

    Pelrine, K.; Mosber, A.; Zheng, J.; Yang, J.-Y .; Peng, A.; Rabbany, R.; and Cheung, J

    A Survey on the Use of Large Language Models (LLMs) in Fake News.Future Internet, 16(8). Pelrine, K.; Mosber, A.; Zheng, J.; Yang, J.-Y .; Peng, A.; Rabbany, R.; and Cheung, J. C. K. 2023. Towards Re- liable Misinformation Mitigation: Generalization, Uncer- tainty, and GPT-4. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pr...

  3. [3]

    P´erez-Rosas, V .; Kleinberg, B.; Lefevre, A.; and Mihalcea, R

    Association for Computational Linguistics. P´erez-Rosas, V .; Kleinberg, B.; Lefevre, A.; and Mihalcea, R. 2018. Automatic Detection of Fake News. InProceed- ings of the 27th International Conference on Computational Linguistics, 3391–3401. Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; and Stein, B. 2018. A Stylometric Inquiry into Hyperpartisan...

  4. [4]

    In Palmer, M.; Hwa, R.; and Riedel, S., eds.,Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2931–2937

    Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In Palmer, M.; Hwa, R.; and Riedel, S., eds.,Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2931–2937. Copenhagen, Denmark: Association for Com- putational Linguistics. Reinemann, C.; Stanyer, J.; Scherr, S.; and Legnante, G

  5. [5]

    Roberts, S

    Hard and Soft News: A Review of Concepts, Oper- ationalizations and Key Findings.Journalism, 13(2): 221– 239. Roberts, S. T. 2019.Behind the screen : content modera- tion in the shadows of social media. New Haven, CT: Yale University Press. ISBN 0-300-24531-9. Schuster, T.; Schuster, R.; Shah, D. J.; and Barzilay, R. 2020. The Limitations of Stylometry fo...

  6. [6]

    Silva, A.; Han, L.; Karunasekera, S.; and Leckie, C

    FakeNewsNet: A Data Repository with News Con- tent, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media.Big Data, 8(3): 171– 188. Silva, A.; Han, L.; Karunasekera, S.; and Leckie, C. 2021a. Embracing Domain Differences in Fake News: Cross- domain Fake News Detection using Multi-modal Data. In Proceedings of the AAAI Confe...

  7. [7]

    arXiv:2309.08674

    Fake News Detectors are Biased against Texts Gener- ated by Large Language Models. arXiv:2309.08674. Sundar, S. S. 2020. Rise of Machine Agency: A Frame- work for Studying the Psychology of Human–AI Interac- tion (HAII).Journal of Computer-Mediated Communica- tion, 25(1): 74–88. Tahmasebi, S.; et al. 2026. Robust Fake News Detection using Large Language M...

  8. [8]

    arXiv:2401.06920

    Comparing GPT-4 and Open-Source Language Mod- els in Misinformation Mitigation. arXiv:2401.06920. Wan, Y .; Wang, X.; Gao, W.; He, J.; and Huang, M

  9. [9]

    InAdvances in Neural Information Processing Systems, volume 38

    Truth over Tricks: Measuring and Mitigating Short- cut Learning in Misinformation Detection. InAdvances in Neural Information Processing Systems, volume 38. Wu, J.; Guo, J.; and Hooi, B. 2023. Fake News in Sheep’s Clothing: Robust Fake News Detection Against LLM-Empowered Style Attacks.arXiv preprint arXiv:2310.10830. Zhao, J.; Guan, Z.; Xu, C.; Zhao, W.;...

  10. [10]

    HARD NEWS - Factual, neutral tone; who/what/when/where/why structure; minimal evaluative language

  11. [11]

    ENTERTAINMENT GOSSIP - Exaggerated, emotionally vivid, hyperbolic; celebrity/personal focus; designed to entertain

  12. [12]

    OPINION EDITORIAL - Explicit subjective stance; passionate critique; persuasive intent

  13. [13]

    genre":

    PROMOTIONAL - Reports on marketing/PR events; promotional language; brand partnerships Article:{text} Respond with ONLY a JSON object:{"genre": "...", "confidence": 0.0-1.0} A.3 Coding Instructions The four genre categories were defined as follows, along with illustrative examples: •ENTERTAINMENT GOSSIP: Celebrity news, relationship updates, lifestyle cov...