pith. sign in

arxiv: 2605.23326 · v1 · pith:QHS6NSSFnew · submitted 2026-05-22 · 💻 cs.CL

ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

Pith reviewed 2026-05-25 04:53 UTC · model grok-4.3

classification 💻 cs.CL
keywords climate changeFacebookdatasetsocial mediapublic discoursetopic modelingsentiment analysisengagement metrics
0
0 comments X

The pith

ClimateChat-300K releases nearly 300,000 Facebook posts on climate change as an open dataset for discourse analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ClimateChat-300K, a dataset of 299,329 public Facebook posts about climate change collected from May 2020 to May 2024 through the CrowdTangle platform. It contains 41 metadata features including post content, engagement metrics, and page attributes from more than 26,000 global pages. The authors apply topic modeling and sentiment analysis to identify ten main themes grouped into five domains and observe that emotional tone, post format, and page identity affect engagement levels. The dataset is positioned to support research on polarization, misinformation, and how discussions change with events such as climate summits and the pandemic period. A reader would care because it supplies a concrete, large-scale collection that makes such studies reproducible across time, geography, and institutional settings.

Core claim

ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse by releasing a large-scale collection of Facebook posts with rich contextual information that enables analyses of public engagement.

What carries the argument

ClimateChat-300K, a dataset of 299,329 Facebook posts equipped with 41 metadata features covering content, engagement, language, timestamps, page categories, and interaction counts.

If this is right

  • Emotional tone, post format, and page identity strongly influence audience engagement with climate-related posts.
  • Visually rich and emotionally charged content receives the highest levels of interaction.
  • Online discussions evolve in response to major events such as international climate summits and the COVID-19 pandemic period.
  • The dataset supports comprehensive analyses of public discourse around climate communication across time and geography.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could serve as a baseline for tracking changes in climate discourse if similar collections are assembled for later time periods.
  • Patterns identified in the five domains might inform experiments that test how different message formats alter engagement on other platforms.
  • Cross-referencing the metadata with external event timelines could quantify the strength of response to specific summits or policy announcements.

Load-bearing premise

The posts collected via CrowdTangle represent a sufficiently unbiased and comprehensive sample of public Facebook discourse on climate change.

What would settle it

A comparison showing that the dataset systematically excludes large segments of climate-related Facebook activity or overrepresents particular page categories relative to the full population of posts would undermine its claimed utility.

read the original abstract

We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper un-derstanding of how public engagement with climate issues develops across time, geography, and institutional contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ClimateChat-300K, a dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 via the CrowdTangle platform from over 26,000 global pages. It includes 41 metadata features and reports high-level analyses using topic modeling and sentiment analysis that identify ten themes grouped into five domains (policy, activism, cooperation, science, conservation), along with observations that emotional tone, post format, and page identity influence engagement and that discussions evolved around events such as climate summits and the COVID-19 period. The work positions the released dataset as an open resource supporting reproducible research on polarization, misinformation, and digital climate discourse.

Significance. If the collection methodology, sampling limitations, and analysis procedures were fully documented and validated, the dataset could provide a useful resource for NLP and social-science research on climate communication. In its current form the significance is reduced because the central claim of enabling reproducible studies on discourse dynamics rests on an uncharacterized sample whose representativeness is not demonstrated.

major comments (2)
  1. [Abstract] Abstract: The assertion that the posts enable 'comprehensive analyses of public discourse around climate communication' and represent 'diverse perspectives' is load-bearing for the reproducibility claim, yet the collection is restricted to CrowdTangle-tracked public pages; no evidence is supplied that the resulting 299k posts match the true distribution of climate-related Facebook activity across languages, regions, page types, or private content.
  2. [Abstract] Abstract (analysis description): The identification of ten themes and the claim that 'emotional tone, post format, and page identity strongly influence audience engagement' are presented without any description of topic-modeling parameters, sentiment-analysis validation, statistical tests, or error quantification, rendering the reported themes and engagement drivers impossible to reproduce or assess for robustness.
minor comments (2)
  1. [Abstract] Abstract: Typographical error 'un-derstanding' should read 'understanding'.
  2. [Abstract] Title vs. abstract: The title describes the dataset as 'Multi-Modal' but the abstract provides no information on non-text modalities (e.g., images or video) included in the 41 metadata features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the ClimateChat-300K manuscript. We address each major comment below and will revise the abstract and add explicit limitations discussion to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the posts enable 'comprehensive analyses of public discourse around climate communication' and represent 'diverse perspectives' is load-bearing for the reproducibility claim, yet the collection is restricted to CrowdTangle-tracked public pages; no evidence is supplied that the resulting 299k posts match the true distribution of climate-related Facebook activity across languages, regions, page types, or private content.

    Authors: We agree that CrowdTangle restricts the sample to public posts from tracked pages and excludes private content and untracked activity. The manuscript details the exact collection process (May 2020–May 2024, >26k pages, 41 metadata fields) and characterizes the sample by language, region, and page category. Because the complete population of climate-related Facebook activity is unobservable, we cannot supply evidence of exact distributional match. We will revise the abstract to qualify claims as applying to public discourse on tracked pages, add a dedicated limitations subsection on sampling biases, and emphasize that the released data and code enable reproduction of analyses on this specific corpus. revision: yes

  2. Referee: [Abstract] Abstract (analysis description): The identification of ten themes and the claim that 'emotional tone, post format, and page identity strongly influence audience engagement' are presented without any description of topic-modeling parameters, sentiment-analysis validation, statistical tests, or error quantification, rendering the reported themes and engagement drivers impossible to reproduce or assess for robustness.

    Authors: The abstract is a high-level summary; the full manuscript's Methods section specifies the topic-modeling procedure (LDA with 10 topics after standard preprocessing), sentiment-analysis pipeline, and statistical tests linking tone/format/page identity to engagement metrics. To address the concern, we will revise the abstract to include a concise methods reference and ensure all parameters, validation steps, and analysis code are released with the dataset for direct reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive dataset release with standard analysis

full rationale

The paper is a dataset release describing collection of 299k Facebook posts via CrowdTangle followed by application of standard topic modeling and sentiment analysis to identify themes. No equations, derivations, predictions, or first-principles results exist that could reduce to inputs by construction. No self-citations support load-bearing uniqueness claims or ansatzes. The central contribution is the open dataset itself, which stands independently of any circular logic in the described methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a dataset release paper, the contribution rests on standard assumptions about platform data access and NLP tool reliability rather than new parameters or entities.

axioms (1)
  • domain assumption Topic modeling and sentiment analysis accurately capture themes and emotional tones in social media text.
    Invoked to identify ten themes grouped into five domains and to link emotional tone to engagement.

pith-pipeline@v0.9.0 · 5779 in / 1264 out tokens · 30595 ms · 2026-05-25T04:53:22.057355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    digitalization,

    Introduction Climate change is widely regarded as one of the most significant global challenges of the 21st cen- tury (Sultana et al., 2024). It threatens the future of communities, ecosystems, and entire generations, demanding an extraordinary level of global cooper- ation. Scientists overwhelmingly agree that human activities are driving the acceleratin...

  2. [2]

    climate change

    Methods 2.1. Data Collection Data Source: We collected data from Facebook Pages using the CrowdTangle platform ( Center, 2024), a public insights tool owned by Meta that provided programmatic access to public content from Pages, Groups, and verified accounts. Crowd- Tangle was widely adopted by journalists and re - searchers for transparency and social me...

  3. [3]

    Dataset Summary The dataset comprises 299,329 climate -related Facebook posts with 41 features, collected from 26,731 unique pages between May 2020 and May

    Results 3.1. Dataset Summary The dataset comprises 299,329 climate -related Facebook posts with 41 features, collected from 26,731 unique pages between May 2020 and May

  4. [4]

    Likes” compris - ing 71.9% of reactions, followed by “Love

    It includes both textual and non -textual at- tributes, offering a comprehensive view of user en- gagement and posting behavior. This four-year tem- poral span enables longitudinal analysis of climate communication dynamics. On average, 205 posts per day were recorded, with notable activity peaks coinciding with global events and most prominently on April...

  5. [5]

    Positive

    Discussions 4.1. Evaluation and Data Quality The whole dataset was annotated by unsupervised VADER sentiment analysis model. To evaluate the performance of unsupervised model, we ran - domly selected 1000 samples and annotated it. Then, we compare the manually annotated data with the sampled data produced by VADER model. Table 6 shows the evaluation These...

  6. [6]

    Access to the dataset requires completion of a request form2

    Dataset Availability and License The ClimateChat-300K dataset is publicly avail - able via Zenodo 1. Access to the dataset requires completion of a request form2. The dataset is released strictly for research pur- poses. All materials are distributed under the Creative Commons Attribution –NonCommercial– ShareAlike 4.0 International License (CC BY -NC- SA...

  7. [7]

    Conclusion ClimateChat-300K represents one of the most com- prehensive open resources to date for analyzing global climate communication on Facebook. By compiling nearly three hundred thousand posts from more than twenty -six thousand pages, the dataset provides a unique opportunity to study the complex interplay between scientific informa - tion, public ...

  8. [8]

    Limitations While the ClimateChat-300K dataset represents a substantial effort to document global online dis - course on climate change, several limitations must be acknowledged. First, the dataset exclusively covers publicly avail- able Facebook Pages, which may not fully reflect broader climate -related conversations occurring on other social media plat...

  9. [9]

    All data were obtained exclusively from publicly accessible Facebook Pages using the offi- cial CrowdTangle API prior to its retirement in 2024

    Ethics Statement This work adheres to established ethical standards in computational social science and linguistic data research. All data were obtained exclusively from publicly accessible Facebook Pages using the offi- cial CrowdTangle API prior to its retirement in 2024. No private or restricted content was accessed, and no attempts were made to infer ...

  10. [10]

    greta effect

    References Christian Bachmaier. 2007. A radial adaptation of the sugiyama framework for visualizing hierarchi- cal information. IEEE Transactions on Visualiza- tion and Computer Graphics, 13(3):583–594. Ricardo Baeza-Yates. 2024. Introduction to respon- sible ai. In Proceedings of the 17th ACM Inter- national Conference on Web Search and Data Mining, page...

  11. [11]

    In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics: Tutorial Abstracts, pages 31–38

    Unsupervised cross -lingual representa - tion learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics: Tutorial Abstracts, pages 31–38. Alexandra Segerberg. 2017. Online and social me- dia campaigns for climate change engagement. In Oxford research encyclopedia of climate sci- ence. Oxford University Press. G...