MegaFake: A Theory-Driven Dataset of Fake News Generated by Large Language Models

Ka Chung Ng; Lionel Z. Wang; Wenqi Fan; Yiming Ma

arxiv: 2408.11871 · v4 · submitted 2024-08-19 · 💻 cs.CL · cs.AI

MegaFake: A Theory-Driven Dataset of Fake News Generated by Large Language Models

Lionel Z. Wang , Ka Chung Ng , Yiming Ma , Wenqi Fan This is my paper

Pith reviewed 2026-05-23 21:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords fake newslarge language modelsdeceptionprompt engineeringdatasetmisinformation detectionsocial psychology

0 comments

The pith

A framework integrating social psychology theories guides LLMs to automatically generate a large dataset of fake news without manual labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a theoretical account of how large language models produce deceptive content by combining established social psychology ideas into one framework. It then applies that account to design a prompt pipeline that turns existing news sources into machine-generated fake news at scale. The resulting dataset supports experiments on both the mechanisms of machine deception and on training detectors that work against LLM output. A reader would care because scalable AI-generated misinformation is already appearing in public information channels and current detection tools were built for human-written examples.

Core claim

By embedding multiple social psychology theories into the LLM-Fake Theory framework the authors produce an automated prompt engineering pipeline that converts items from an existing news collection into a large set of machine-generated fake news examples; experiments on this set advance both explanations of human-machine deception and practical detection methods for the LLM era.

What carries the argument

The LLM-Fake Theory framework, which integrates social psychology theories to model the motivations and mechanisms of machine-generated deception and then directs the prompt pipeline.

If this is right

Detection systems can be trained directly on machine-generated examples rather than relying on scarce human-labeled data.
The same prompt pipeline can be reused to expand the dataset or adapt it to new source collections without additional manual work.
Theoretical accounts of deception can be tested and refined by measuring how well models trained under the framework transfer to real-world LLM outputs.
Governance approaches can incorporate the identified deception mechanisms when designing platform policies for AI content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the framework holds, the same theory-driven prompts could be adapted to generate other forms of deceptive text such as synthetic reviews or political statements.
Detection research could next test whether models fine-tuned on this dataset retain performance when the underlying LLM changes to a newer version.
The pipeline might reduce the cost barrier for creating balanced training sets that include both human and machine fake news.

Load-bearing premise

That combining social psychology theories into the framework correctly describes how LLMs deceive and that the prompt pipeline therefore produces realistic enough examples to support useful detection research.

What would settle it

A controlled test in which detectors trained on the new dataset show no improvement over detectors trained on human-written fake news when both are evaluated on fresh LLM-generated examples from unrelated sources.

Figures

Figures reproduced from arXiv: 2408.11871 by Ka Chung Ng, Lionel Z. Wang, Wenqi Fan, Yiming Ma.

**Figure 2.** Figure 2: Results for Different Topic Numbers (Information Blending) [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Document Matching Process (Information Blending) Split Split Sort by Probability Sort by Probability Pair Pair Topic Documents Odd Legitimate Documents Even Legitimate Documents Sorted Odd Legitimate Sorted Even Legitimate Document Pairs [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 5.** Figure 5: Results for Different Topic Numbers (News Summarization) [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Results for Different Temperatures (Writing Enhancement) [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Results for Different Temperatures (News Summarization) [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Word Clouds of the GossipCop and MegaFake Datasets [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Word Clouds of Six Different News Types in MegaFake Dataset [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Sentence Count and Word Count Density Histogram for GossipCop and MegaFake Datasets [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Sentence Count and Word Count Density Histogram for Six Different News Types in MegaFake Dataset [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Sentence Length and Word Length Density Histograms for GossipCop and MegaFake Datasets [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Sentence Length and Word Length Density Histograms for Six Different News Types in MegaFake [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of Result for Experiments on Classifying Legitimate News Types [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Visualization of Result for Experiments on Classifying Fake News Types [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Confusion Matrix 33 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Visualization of Result for Experiments on Classifying Six Different News Types [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

read the original abstract

Fake news significantly influences decision-making processes by misleading individuals, organizations, and even governments. Large language models (LLMs), as part of generative AI, can amplify this problem by generating highly convincing fake news at scale, posing a significant threat to online information integrity. Therefore, understanding the motivations and mechanisms behind fake news generated by LLMs is crucial for effective detection and governance. In this study, we develop the LLM-Fake Theory, a theoretical framework that integrates various social psychology theories to explain machine-generated deception. Guided by this framework, we design an innovative prompt engineering pipeline that automates fake news generation using LLMs, eliminating manual annotation needs. Utilizing this pipeline, we create a theoretically informed \underline{M}achin\underline{e}-\underline{g}ener\underline{a}ted \underline{Fake} news dataset, MegaFake, derived from FakeNewsNet. Through extensive experiments with MegaFake, we advance both theoretical understanding of human-machine deception mechanisms and practical approaches to fake news detection in the LLM era.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the LLM-Fake Theory, a framework integrating multiple social psychology theories to explain motivations and mechanisms of deception in LLM-generated fake news. It describes an automated prompt-engineering pipeline that uses this theory to generate the MegaFake dataset from FakeNewsNet, eliminating manual annotation, and reports extensive experiments on the resulting dataset to advance both theoretical insight into human-machine deception and practical detection methods in the LLM era.

Significance. If the theory-to-prompt translation is shown to produce outputs that reliably instantiate the targeted constructs and if downstream experiments demonstrate measurable gains over non-theory baselines, the work would supply a scalable, theory-grounded resource that could support more interpretable detection research and reduce reliance on manually curated corpora.

major comments (1)

[Abstract and pipeline description] Abstract and pipeline description: the claim that the LLM-Fake Theory produces a prompt pipeline whose outputs embody the intended deception mechanisms (and thereby enable theoretical advancement) is load-bearing, yet no validation is reported—neither human ratings nor automated metrics—showing that generated articles exhibit the integrated social-psychology constructs at rates above those obtained from generic LLM prompting. Without such evidence the dataset's claimed utility for studying human-machine deception collapses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment identifies a gap in empirical validation of the theory-to-prompt pipeline, which we address directly below.

read point-by-point responses

Referee: [Abstract and pipeline description] Abstract and pipeline description: the claim that the LLM-Fake Theory produces a prompt pipeline whose outputs embody the intended deception mechanisms (and thereby enable theoretical advancement) is load-bearing, yet no validation is reported—neither human ratings nor automated metrics—showing that generated articles exhibit the integrated social-psychology constructs at rates above those obtained from generic LLM prompting. Without such evidence the dataset's claimed utility for studying human-machine deception collapses.

Authors: We agree that the manuscript does not report direct validation (human ratings or automated metrics) comparing the presence of the targeted social-psychology constructs in theory-guided outputs versus generic LLM prompting. The current experiments demonstrate downstream utility for detection tasks but do not isolate whether the pipeline reliably instantiates the LLM-Fake Theory constructs. In revision we will add a human evaluation study in which annotators rate randomly sampled articles from both conditions on the presence and strength of the integrated deception mechanisms, together with inter-annotator agreement statistics. If feasible we will also report any automated proxies (e.g., lexical or semantic indicators derived from the theory). This addition will directly test the load-bearing claim. revision: yes

Circularity Check

0 steps flagged

No circularity; theory-to-dataset pipeline is self-contained without reduction to inputs

full rationale

The paper constructs LLM-Fake Theory by integrating existing social psychology theories, applies it to design a prompt pipeline, generates MegaFake from FakeNewsNet, and runs experiments. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims about advancing understanding do not reduce by construction to the paper's own inputs or prior self-referential results, meeting the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the newly assembled LLM-Fake Theory and the effectiveness of the prompt pipeline in producing usable data without manual labels; no numeric free parameters are described.

axioms (1)

domain assumption Social psychology theories about deception can be integrated to explain machine-generated fake news
Invoked to justify construction of the LLM-Fake Theory framework

invented entities (1)

LLM-Fake Theory no independent evidence
purpose: Theoretical framework integrating social psychology to explain LLM-generated deception
Newly proposed construct that guides the prompt engineering pipeline

pith-pipeline@v0.9.0 · 5711 in / 1401 out tokens · 31226 ms · 2026-05-23T21:29:47.932362+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Harms: A Taxonomy and Discussion
cs.CY 2025-12 unverdicted novelty 3.0

This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Ad- vances in neural information processing systems , 33:4271–4282

Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Ad- vances in neural information processing systems , 33:4271–4282. Debarati Das, Karin De Langis, Anna Martin, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Hay- ati, Risako Owan, Bin Hu, Ritik Parkar, et al. 2024. Under the surface: Tracking the artifactuality ...

work page arXiv 2024
[2]

arXiv preprint arXiv:2006.03659

Declutr: Deep contrastive learning for un- supervised textual representations. arXiv preprint arXiv:2006.03659. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, et al. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793. Kat...

work page arXiv 2006
[3]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Albert: A lite bert for self-supervised learn- ing of language representations. arXiv preprint arXiv:1909.11942. David MJ Lazer, Matthew A Baum, Yochai Ben- kler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Pennycook, David Rothschild, et al. 2018. The sci- ence of fake news. Science, 359(6380):1094–1096. K...

work page internal anchor Pith review Pith/arXiv arXiv 1909
[4]

arXiv preprint arXiv:2310.15515

Fighting fire with fire: The dual role of llms in crafting and detecting elusive disinformation. arXiv preprint arXiv:2310.15515. Zheheng Luo, Qianqian Xie, and Sophia Ananiadou

work page arXiv
[5]

arXiv preprint arXiv:2303.15621

Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621. Patricia Moravec, Randall Minas, and Alan R Dennis

work page arXiv
[6]

Kelley School of Business research paper

Fake news on social media: People believe what they want to believe when it makes no sense at all. Kelley School of Business research paper. Martin Müller, Marcel Salathé, and Per E Kummervold

work page
[7]

liar, liar pants on fire

Covid-twitter-bert: A natural language pro- cessing model to analyse covid-19 content on twitter. Frontiers in artificial intelligence, 6:1023281. Taichi Murayama. 2021. Dataset of fake news detec- tion and fact verification: a survey. arXiv preprint arXiv:2111.03299. Qiong Nan, Qiang Sheng, Juan Cao, Beizhe Hu, Dand- ing Wang, and Jintao Li. 2024. Let si...

work page arXiv 2021
[8]

Transactions of the Association for Computational Linguistics, 12:39–57

Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57. Xiaohui Zhang, Qianzhou Du, and Zhongju Zhang

work page
[9]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

A theory-driven machine learning system for financial disinformation detection. Production and Operations Management, 31(8):3160–3179. Xichen Zhang and Ali A Ghorbani. 2020. An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 57(2):102025. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, T...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

megafake-1_style_based_fake.json con- tains the following fields: origin_id, origin_label, origin_text, generated_text, generated_tone, generated_label

work page
[11]

megafake-2_content_based_fake.json contains the following fields: origin_id, origin_label, origin_text, generated_text_glm4

work page
[12]

megafake-3_integration_based_fake _tn200.json contains the following fields: doc_1_id, doc_1_label, doc_1_text, doc_2_id, doc_2_label, doc_2_text, generated_text. 13

work page
[13]

megafake-4_story_based_fake.json contains the following fields: origin_id, origin_label, origin_text, generated_text

work page
[14]

megafake-5_style_based_legitimate.json contains the following fields: origin_id, origin_label, origin_text, generated_label, generated_text_t015

work page
[15]

containing sensi- tive content

megafake-7_integration_based_legitimate _tn300.json contains the follow- ing fields: topic_id, topic_words, doc_1_id, doc_1_label, doc_1_text, doc_2_id, doc_2_label, doc_2_text, generated_label, generated_text_t01. The dataset comprises six different types: • megafake-1_style_based_fake.json: 15,729 news items. • megafake-2_content_based_fake.json: 11,941...

work page 2023
[16]

[fake news article]

Rewrite the following news article in an objective and professional tone without changing the content and meaning while keeping a similar length. [fake news article]

work page
[17]

[fake news article] For a human-generated legitimate news article as an input:

Rewrite the following news article in a neutral tone without changing the content and meaning while keeping a similar length. [fake news article] For a human-generated legitimate news article as an input:

work page
[18]

[legitimate news article]

Rewrite the following news article in an emotion- ally triggering tone without changing the content and meaning while keeping a similar length. [legitimate news article]

work page
[19]

said," "time,

Rewrite the following news article in a sensational tone without changing the content and meaning while keeping a similar length. [legitimate news article] Content Ma- nipulation (Satapara et al. 2024) Utilize a LLM to manipulate human-generated legitimate news content by mod- ifying multiple attributes. Elaboration Likeli- hood Model (Petty and Briñol 20...

work page arXiv 2024

[1] [1]

Ad- vances in neural information processing systems , 33:4271–4282

Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Ad- vances in neural information processing systems , 33:4271–4282. Debarati Das, Karin De Langis, Anna Martin, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Hay- ati, Risako Owan, Bin Hu, Ritik Parkar, et al. 2024. Under the surface: Tracking the artifactuality ...

work page arXiv 2024

[2] [2]

arXiv preprint arXiv:2006.03659

Declutr: Deep contrastive learning for un- supervised textual representations. arXiv preprint arXiv:2006.03659. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, et al. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793. Kat...

work page arXiv 2006

[3] [3]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Albert: A lite bert for self-supervised learn- ing of language representations. arXiv preprint arXiv:1909.11942. David MJ Lazer, Matthew A Baum, Yochai Ben- kler, Adam J Berinsky, Kelly M Greenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Pennycook, David Rothschild, et al. 2018. The sci- ence of fake news. Science, 359(6380):1094–1096. K...

work page internal anchor Pith review Pith/arXiv arXiv 1909

[4] [4]

arXiv preprint arXiv:2310.15515

Fighting fire with fire: The dual role of llms in crafting and detecting elusive disinformation. arXiv preprint arXiv:2310.15515. Zheheng Luo, Qianqian Xie, and Sophia Ananiadou

work page arXiv

[5] [5]

arXiv preprint arXiv:2303.15621

Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621. Patricia Moravec, Randall Minas, and Alan R Dennis

work page arXiv

[6] [6]

Kelley School of Business research paper

Fake news on social media: People believe what they want to believe when it makes no sense at all. Kelley School of Business research paper. Martin Müller, Marcel Salathé, and Per E Kummervold

work page

[7] [7]

liar, liar pants on fire

Covid-twitter-bert: A natural language pro- cessing model to analyse covid-19 content on twitter. Frontiers in artificial intelligence, 6:1023281. Taichi Murayama. 2021. Dataset of fake news detec- tion and fact verification: a survey. arXiv preprint arXiv:2111.03299. Qiong Nan, Qiang Sheng, Juan Cao, Beizhe Hu, Dand- ing Wang, and Jintao Li. 2024. Let si...

work page arXiv 2021

[8] [8]

Transactions of the Association for Computational Linguistics, 12:39–57

Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57. Xiaohui Zhang, Qianzhou Du, and Zhongju Zhang

work page

[9] [9]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

A theory-driven machine learning system for financial disinformation detection. Production and Operations Management, 31(8):3160–3179. Xichen Zhang and Ali A Ghorbani. 2020. An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 57(2):102025. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, T...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

megafake-1_style_based_fake.json con- tains the following fields: origin_id, origin_label, origin_text, generated_text, generated_tone, generated_label

work page

[11] [11]

megafake-2_content_based_fake.json contains the following fields: origin_id, origin_label, origin_text, generated_text_glm4

work page

[12] [12]

megafake-3_integration_based_fake _tn200.json contains the following fields: doc_1_id, doc_1_label, doc_1_text, doc_2_id, doc_2_label, doc_2_text, generated_text. 13

work page

[13] [13]

megafake-4_story_based_fake.json contains the following fields: origin_id, origin_label, origin_text, generated_text

work page

[14] [14]

megafake-5_style_based_legitimate.json contains the following fields: origin_id, origin_label, origin_text, generated_label, generated_text_t015

work page

[15] [15]

containing sensi- tive content

megafake-7_integration_based_legitimate _tn300.json contains the follow- ing fields: topic_id, topic_words, doc_1_id, doc_1_label, doc_1_text, doc_2_id, doc_2_label, doc_2_text, generated_label, generated_text_t01. The dataset comprises six different types: • megafake-1_style_based_fake.json: 15,729 news items. • megafake-2_content_based_fake.json: 11,941...

work page 2023

[16] [16]

[fake news article]

Rewrite the following news article in an objective and professional tone without changing the content and meaning while keeping a similar length. [fake news article]

work page

[17] [17]

[fake news article] For a human-generated legitimate news article as an input:

Rewrite the following news article in a neutral tone without changing the content and meaning while keeping a similar length. [fake news article] For a human-generated legitimate news article as an input:

work page

[18] [18]

[legitimate news article]

Rewrite the following news article in an emotion- ally triggering tone without changing the content and meaning while keeping a similar length. [legitimate news article]

work page

[19] [19]

said," "time,

Rewrite the following news article in a sensational tone without changing the content and meaning while keeping a similar length. [legitimate news article] Content Ma- nipulation (Satapara et al. 2024) Utilize a LLM to manipulate human-generated legitimate news content by mod- ifying multiple attributes. Elaboration Likeli- hood Model (Petty and Briñol 20...

work page arXiv 2024