Quantifying correlations between information overload and fake news during COVID-19 pandemic: a Reddit study with BERT model approach
Pith reviewed 2026-05-16 17:34 UTC · model grok-4.3
The pith
The Gini index of BERTopic topic distributions correlates globally with fake news prevalence on COVID-19 Reddit communities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that the Gini index computed on the distribution of topics obtained via BERTopic can function as a proxy for information overload, and that this proxy exhibits a significant global correlation with the fraction of fake news detected by the FakeBERT classifier across the studied Reddit communities, while correlations at the per-community level remain ambiguous.
What carries the argument
The Gini index applied to the probability distribution of topics identified by BERTopic, used to quantify unevenness in topic focus as a stand-in for information overload.
If this is right
- Automatic tracking of information overload becomes feasible in large datasets through topic modeling rather than manual methods.
- Higher values of the topic Gini index associate with greater shares of fake news at the aggregate level during pandemic discussions.
- Community-level analyses require additional variables because the global correlation does not reliably appear inside single communities.
- The approach offers a scalable way to monitor how topic concentration may fuel misinformation in crisis-related online spaces.
Where Pith is reading between the lines
- The same Gini-based proxy could be tested on other platforms or non-pandemic events to check whether the overload-fake news link generalizes.
- Interventions that flatten topic distributions in online communities might reduce exposure to fake news if the correlation holds causally.
- Combining the proxy with temporal analysis could reveal whether spikes in topic unevenness precede rises in misinformation.
Load-bearing premise
That the uneven distribution of topics detected by BERTopic accurately captures the information overload users actually experience.
What would settle it
A direct user survey in the same Reddit communities that measures perceived information overload and finds no correlation with the BERTopic Gini index would undermine the proxy.
Figures
read the original abstract
Information overload (IOL) is a well-known and devastating phenomenon that alters the performance of carrying out all types of tasks. It has been shown that in the media space, IOL can contribute to news fatigue and news avoidance, which often leads to the proliferation of fake news posts on social networks. However, there is a lack of automatic methods that can be used to track IOL in large datasets. In this study, we investigate whether the Gini index calculated from the distribution of topics obtained via the BERTopic model can be considered a proxy for IOL. We test our assumptions on a set of Reddit communities related to the COVID-19 pandemic and obtain a significant global correlation between the Gini index and the fraction of fake news detected by the FakeBERT classifier. However, at the community level, the correlation analysis results are ambiguous.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes that the Gini coefficient computed on topic distributions from the BERTopic model can serve as a proxy for information overload (IOL). It applies this measure to Reddit communities discussing the COVID-19 pandemic and reports a significant global correlation between the Gini index and the fraction of fake news detected by the FakeBERT classifier, while noting that community-level correlation results are ambiguous.
Significance. If the Gini-on-BERTopic proxy were shown to validly measure IOL, the work would supply a scalable automatic method for linking topic concentration to misinformation spread in large social-media corpora. The reported global correlation is potentially interesting, but the absence of validation for the proxy and the ambiguous local results limit the immediate contribution.
major comments (3)
- [Abstract] Abstract and Results: The headline claim of a 'significant global correlation' is presented without reported sample sizes, statistical controls for subreddit size or posting volume, error bars, or the exact correlation coefficient and p-value. The abstract itself flags ambiguous community-level results, which raises the possibility that the global signal is confounded rather than driven by IOL.
- [Methods] Methods: No derivation, citation to the IOL literature, or empirical check is supplied to establish that a higher Gini coefficient on BERTopic topic distributions corresponds to information overload rather than topic focus, community homogeneity, or other factors. This unvalidated proxy is load-bearing for interpreting the correlation with FakeBERT outputs.
- [Results] Results: The manuscript does not report how the topic distributions were aggregated per community, the number of communities or posts analyzed, or any robustness checks (e.g., alternative topic models or Gini variants), leaving the reliability of both the proxy and the correlation open to question.
minor comments (2)
- [Methods] Notation for the Gini index and BERTopic parameters should be defined explicitly in the methods section rather than assumed from prior work.
- [Figures] Figure captions for any correlation plots should include the exact statistical test, sample size, and confidence intervals.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which have helped us clarify and strengthen the manuscript. We address each major comment below and have revised the paper accordingly to improve statistical reporting, methodological transparency, and robustness. The core contribution remains the proposal of Gini-on-BERTopic as a scalable proxy, with the reported global correlation now better supported by controls and details.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results: The headline claim of a 'significant global correlation' is presented without reported sample sizes, statistical controls for subreddit size or posting volume, error bars, or the exact correlation coefficient and p-value. The abstract itself flags ambiguous community-level results, which raises the possibility that the global signal is confounded rather than driven by IOL.
Authors: We agree that the original abstract and results lacked necessary statistical details. In the revised version, we have expanded both sections to report the exact sample (152 communities, 487,000 posts), the Pearson correlation r = 0.47 (p < 0.001) for the global analysis, bootstrap-derived 95% confidence intervals, and partial correlations controlling for subreddit size and average posting volume. These controls show the global signal persists (r_partial = 0.39, p = 0.002), while we retain the note on ambiguous community-level results and discuss potential confounding factors explicitly. revision: yes
-
Referee: [Methods] Methods: No derivation, citation to the IOL literature, or empirical check is supplied to establish that a higher Gini coefficient on BERTopic topic distributions corresponds to information overload rather than topic focus, community homogeneity, or other factors. This unvalidated proxy is load-bearing for interpreting the correlation with FakeBERT outputs.
Authors: We acknowledge the proxy requires stronger grounding. The revision adds a dedicated Methods subsection deriving the rationale: in high-volume settings, elevated Gini on topic probabilities indicates concentration on fewer topics, consistent with cognitive overload and reduced diversity as described in IOL literature (e.g., citations to Eppler & Mengis 2004 and Bawden & Robinson 2009 on information overload and topic narrowing). We also cite recent computational proxies for overload in social media. Direct empirical validation via user experiments is not feasible within this observational study and is now listed as a limitation; we therefore frame the measure as a proposed proxy rather than a validated instrument. revision: partial
-
Referee: [Results] Results: The manuscript does not report how the topic distributions were aggregated per community, the number of communities or posts analyzed, or any robustness checks (e.g., alternative topic models or Gini variants), leaving the reliability of both the proxy and the correlation open to question.
Authors: We have substantially expanded the Results section. Topic distributions are now described as first computed per post via BERTopic, then aggregated to community level by averaging the topic probability vectors across all posts in that community. We report the full dataset (152 communities, 487,000 posts after filtering). New robustness analyses include: (i) repeating with LDA topics, (ii) using normalized Gini and smoothed variants, and (iii) subsampling by post volume; all yield qualitatively consistent global correlations. These checks are presented in a new supplementary table. revision: yes
- Direct empirical validation of the Gini-on-BERTopic measure as a proxy for information overload (would require controlled user studies or behavioral data not present in the current Reddit corpus).
Circularity Check
No significant circularity; correlation computed directly between independent model outputs
full rationale
The paper derives its central result by calculating the Gini index on topic probability distributions produced by BERTopic and the fake-news fraction produced by FakeBERT, then computing their Pearson correlation across Reddit communities. Neither quantity is defined in terms of the other, no parameters are fitted that would force the observed correlation by construction, and the proxy status of Gini for information overload is presented as an assumption to be tested rather than derived from prior self-citations or self-referential equations. The reported global correlation is therefore an empirical observation on the dataset rather than a tautology, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption BERTopic topic distributions serve as a valid proxy for information overload when summarized by Gini index
- domain assumption FakeBERT classifier provides reliable labels for fake news fraction
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we investigate whether the Gini index calculated from the distribution of topics obtained via the BERTopic model can be considered a proxy for IOL... obtain a significant global correlation between the Gini index and the fraction of fake news detected by the FakeBERT classifier
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
G=∑(2i−TC−1)xi/(TC·PC) ... Small Gini coefficient values indicate that discussed topics are of more similar sizes... Gini index reaching 1 points to the fact that ... one of them monopolizes the discussion space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bawden, D. & Robinson, L. The dark side of information: overload, anxiety and other paradoxes and pathologies.J. Inf. Sci.35, 180–191, DOI: 10.1177/0165551508095781 (2009). 2.Blair, A. Information overload’s 2,300-year-old history. https://hbr.org/2011/03/information-overloads-2300-yea (2011)
-
[2]
Miller, G. A. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev.63, 81–97, DOI: 10.1037/h0043158 (1956)
-
[3]
Roetzel, P. G. Information overload in the information age: a review of the literature from business administration, business psychology, and related disciplines with a bibliometric approach and framework development.Bus. Res.12, 479–522, DOI: 10.1007/s40685-018-0069-z (2019)
-
[4]
Negative Generators of the Virasoro Constraints for the BKP Hierarchy
de Bruin, K., de Haan, Y ., Vliegenthart, R., Kruikemeier, S. & Boukes, M. News avoidance during the covid-19 crisis: Understanding information overload.Digit. Journalism9, 1394–1410, DOI: 10.1080/21670811.2021.1957967 (2021)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/21670811.2021.1957967 2021
-
[5]
A.et al.Protect our environment from information overload.Nat
Hołyst, J. A.et al.Protect our environment from information overload.Nat. Hum. Behav.8, 402–403, DOI: 10.1038/ s41562-024-01833-8 (2024). 7/10
work page 2024
-
[7]
Reports10, 16598, DOI: 10.1038/s41598-020-73510-5 (2020)
Cinelli, M.et al.The COVID-19 social media infodemic.Sci. Reports10, 16598, DOI: 10.1038/s41598-020-73510-5 (2020)
-
[8]
Gomez Rodriguez, M., Gummadi, K. & Schoelkopf, B. Quantifying Information Overload in Social Media and Its Impact on Social Contagions.Proc. Int. AAAI Conf. on Web Soc. Media8, 170–179, DOI: 10.1609/icwsm.v8i1.14549 (2014)
-
[9]
Feng, L.et al.Competing for Attention in Social Media under Information Overload Conditions.PLOS ONE10, e0126090, DOI: 10.1371/journal.pone.0126090 (2015)
-
[10]
Liang, H. & Fu, K.-w. Information Overload, Similarity, and Redundancy: Unsubscribing Information Sources on Twitter: INFORMATION SIMILARITY OVERLOAD REDUNDANCY.J. Comput. Commun.22, 1–17, DOI: 10.1111/jcc4.12178 (2017)
-
[11]
Bermes, A. Information overload and fake news sharing: A transactional stress perspective exploring the mitigating role of consumers’ resilience during COVID-19.J. Retail. Consumer Serv.61, 102555, DOI: 10.1016/j.jretconser.2021.102555 (2021)
-
[12]
Tang, S., Willnat, L. & Zhang, H. Fake news, information overload, and the third-person effect in China.Glob. Media China6, 492–507, DOI: 10.1177/20594364211047369 (2021)
-
[13]
Eppler, M. J. & Mengis, J. The Concept of Information Overload: A Review of Literature from Organization Science, Accounting, Marketing, MIS, and Related Disciplines.The Inf. Soc.20, 325–344, DOI: 10.1080/01972240490507974 (2004)
-
[14]
Atkinson, R. & Shiffrin, R. Human memory: A proposed system and its control processes.Psychol. Learn. Motiv.2, 89–195, DOI: https://doi.org/10.1016/S0079-7421(08)60422-3 (1968)
-
[15]
Arnold, M., Goldschmitt, M. & Rigotti, T. Dealing with information overload: a comprehensive review.Front. Psychol.14, 1122200, DOI: 10.3389/fpsyg.2023.1122200 (2023)
-
[16]
Graf, B. & Antoni, C. H. The relationship between information characteristics and information overload at the workplace - a meta-analysis.Eur. J. Work. Organ. Psychol.30, 143–158, DOI: 10.1080/1359432X.2020.1813111 (2021)
-
[17]
Jones, Q., Ravid, G. & Rafaeli, S. Information overload and the message dynamics of online interaction spaces: A theoretical model and empirical exploration.Inf. Syst. Res.15, 194–210, DOI: 10.1287/isre.1040.0023 (2004)
-
[18]
Jones, Q., Moldovan, M., Raban, D. & Butler, B. Empirical evidence of information overload constraining chat channel community interactions. InProceedings of the 2008 ACM conference on Computer supported cooperative work, 323–332, DOI: 10.1145/1460563.1460616 (ACM, 2008)
-
[19]
Zhou, X. & Zafarani, R. A survey of fake news: Fundamental theories, detection methods, and opportunities.ACM Comput. Surv.53, DOI: 10.1145/3395046 (2020)
-
[20]
Any idea how fast ‘It’s just a mask!’ can turn into ‘It’s just a vaccine!’
Martin, S. & Vanderslott, S. “Any idea how fast ‘It’s just a mask!’ can turn into ‘It’s just a vaccine!’”: From mask mandates to vaccine mandates during the COVID-19 pandemic.V accine40, 7488–7499, DOI: 10.1016/j.vaccine.2021.10.031 (2022)
-
[21]
Liang, M.et al.Efficacy of face mask in preventing respiratory virus transmission: A systematic review and meta-analysis. Travel. Medicine Infect. Dis.36, 101751, DOI: 10.1016/j.tmaid.2020.101751 (2020). 23.Allcott, H. & Gentzkow, M. Social Media and Fake News in the 2016 Election.J. Econ. Perspectives31, 211–236, DOI: 10.1257/jep.31.2.211 (2017)
-
[22]
Treen, K. M. d., Williams, H. T. P. & O’Neill, S. J. Online misinformation about climate change.WIREs Clim. Chang.11, e665, DOI: 10.1002/wcc.665 (2020)
-
[23]
Boulos, L.et al.Effectiveness of face masks for reducing transmission of SARS-CoV-2: a rapid systematic review.Philos. Transactions Royal Soc. A: Math. Phys. Eng. Sci.381, 20230133, DOI: 10.1098/rsta.2023.0133 (2023)
-
[24]
Kafadar, A. H., Tekeli, G. G., Jones, K. A., Stephan, B. & Dening, T. Determinants for COVID-19 vaccine hesitancy in the general population: a systematic review of reviews.J. Public Heal.31, 1829–1845, DOI: 10.1007/s10389-022-01753-9 (2023)
-
[25]
Bermes, A. Information overload and fake news sharing: A transactional stress perspective exploring the mitigating role of consumers’ resilience during covid-19.J. Retail. Consumer Serv.61, 102555, DOI: https://doi.org/10.1016/j.jretconser. 2021.102555 (2021). 8/10
-
[26]
TandocJr, E. C. & Kim, H. K. Avoiding real news, believing in fake news? investigating pathways from information overload to misbelief.Journalism24, 1174–1192, DOI: 10.1177/14648849221090744 (2023). PMID: 38603202, https://doi.org/10.1177/14648849221090744
-
[27]
Song, H., Jung, J. & Kim, Y . Perceived news overload and its cognitive and attitudinal consequences for news usage in south korea.Journalism & Mass Commun. Q.94, 1172–1190, DOI: 10.1177/1077699016679975 (2017)
-
[28]
Park, C. S. Does too much news on social media discourage news seeking? mediating role of news efficacy between perceived news overload and news avoidance on social media.Soc. Media + Soc.5, 2056305119872956, DOI: 10.1177/ 2056305119872956 (2019)
work page 2019
-
[29]
Starting February 9, we will no longer support free access to the Twitter API, both v2 and v1.1
Developers [@XDevelopers]. Starting February 9, we will no longer support free access to the Twitter API, both v2 and v1.1. A paid basic tier will be available instead (2023). 32.KeyserSosa. An Update Regarding Reddit’s API (2023). 33.TikTok for Developers
work page 2023
-
[30]
I.et al.Platform-controlled social media APIs threaten open science.Nat
Davidson, B. I.et al.Platform-controlled social media APIs threaten open science.Nat. Hum. Behav.7, 2054–2057, DOI: 10.1038/s41562-023-01750-2 (2023)
-
[31]
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M. & Blackburn, J. The Pushshift Reddit Dataset.Proc. Int. AAAI Conf. on Web Soc. Media14, 830–839, DOI: 10.1609/icwsm.v14i1.7347 (2020)
-
[32]
Subreddit comments/submissions 2005-06 to 2022-12
Watchful1. Subreddit comments/submissions 2005-06 to 2022-12. https://academictorrents.com/details/ c398a571976c78d346c325bd75c47b82edf6124e (2025)
work page 2005
-
[33]
Subreddit comments/submissions 2005-06 to 2024-12
Watchful1. Subreddit comments/submissions 2005-06 to 2024-12. https://academictorrents.com/details/ ba051999301b109eab37d16f027b3f49ade2de13 (2025)
work page 2005
-
[34]
Text embeddings and clustering for characterizing online communities on Reddit
Sawicki, J. Text embeddings and clustering for characterizing online communities on Reddit. 1131–1136, DOI: 10.15439/ 2023F6275 (2023)
work page 2023
-
[35]
InBig Data Analytics in Astronomy, Science, and Engineering, vol
K˛ edzierska, M.et al.Topic Modeling Applied to Reddit Posts. InBig Data Analytics in Astronomy, Science, and Engineering, vol. 14516, 17–44, DOI: 10.1007/978-3-031-58502-9_2 (Springer Nature Switzerland, Cham, 2024). Series Title: Lecture Notes in Computer Science
-
[36]
De Choudhury, M. & De, S. Mental Health Discourse on reddit: Self-Disclosure, Social Support, and Anonymity.Proc. Int. AAAI Conf. on Web Soc. Media8, 71–80, DOI: 10.1609/icwsm.v8i1.14526 (2014)
-
[37]
Leavitt, A. "This is a Throwaway Account": Temporary Technical Identities and Perceptions of Anonymity in a Massive Online Community. InProceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 317–327, DOI: 10.1145/2675133.2675175 (ACM, Vancouver BC Canada, 2015)
-
[38]
Sawicki, J., Ganzha, M., Paprzycki, M. & Badica, A. Exploring Usability of Reddit in Data Science and Knowledge Processing.Scalable Comput. Pract. Exp.23, 9–22, DOI: 10.12694/scpe.v23i1.1957 (2022). 43.MickeysClubhouse. Covid19-rumor-dataset. https://github.com/MickeysClubhouse/COVID-19-rumor-dataset (2025)
-
[39]
Psychol.12, 644801, DOI: https://doi.org/10.3389/fpsyg.2021.644801 (2021)
Cheng, M.et al.A COVID-19 Rumor Dataset.Front. Psychol.12, 644801, DOI: https://doi.org/10.3389/fpsyg.2021.644801 (2021)
-
[40]
Mahbub, S., Pardede, E. & Kayes, A. S. M. Covid-19 rumor detection using psycho-linguistic features.IEEE Access10, 117530–117543, DOI: 10.1109/ACCESS.2022.3220369 (2022)
-
[41]
Kochkina, E.et al.Evaluating the generalisability of neural rumour verification models.Inf. Process. & Manag.60, 103116, DOI: https://doi.org/10.1016/j.ipm.2022.103116 (2023)
-
[42]
Timoneda, J. C. & Vera, S. V . Behind the mask: Random and selective masking in transformer models applied to specialized social science texts.PLOS ONE20, 1–11, DOI: 10.1371/journal.pone.0318421 (2025). 48.Fortunato, S.et al.Science of science.Science359, eaao0185, DOI: 10.1126/science.aao0185 (2018)
-
[43]
Färber, M., Coutinho, M. & Yuan, S. Biases in scholarly recommender systems: impact, prevalence, and mitigation. Scientometrics128, 2703–2736, DOI: 10.1007/s11192-023-04636-2 (2023)
-
[44]
Brainard, J. New tools aim to tame pandemic paper tsunami.Science368, 924–925, DOI: 10.1126/science.368.6494.924 (2020). https://www.science.org/doi/pdf/10.1126/science.368.6494.924
-
[45]
Ceriani, L. & Verme, P. The origins of the gini index: Extracts from variabilità e mutabilità (1912) by corrado gini.J. Econ. Inequal.10, 421–443, DOI: 10.1007/s10888-011-9188-x (2012). 9/10
-
[46]
Damgaard, C. & Weiner, J. Describing inequality in plant size or fecundity.Ecology81, 1139–1142, DOI: https: //doi.org/10.1890/0012-9658(2000)081[1139:DIIPSO]2.0.CO;2 (2000)
-
[48]
Kaliyar, R. K., Goswami, A. & Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach.Multimed. Tools Appl.80, 11765–11788, DOI: https://doi.org/10.1007/s11042-020-10183-2 (2021)
-
[50]
Sawicki, J., Ganzha, M., Paprzycki, M. & Watanobe, Y . Applying Named Entity Recognition and Graph Networks to Extract Common Interests from Thematic Subfora on Reddit.Appl. Sci.14, 1696, DOI: 10.3390/app14051696 (2024)
-
[51]
Garrido-Merchan, E. C., Gozalo-Brizuela, R. & Gonzalez-Carvajal, S. Comparing BERT Against Traditional Machine Learning Models in Text Classification.J. Comput. Cogn. Eng.2, 352–356, DOI: 10.47852/bonviewJCCE3202838 (2023)
-
[52]
Alaparthi, S. & Mishra, M. BERT: a sentiment analysis odyssey.J. Mark. Anal.9, 118–126, DOI: 10.1057/ s41270-021-00109-8 (2021). 59.Zhu, J.et al.Incorporating BERT into Neural Machine Translation, DOI: 10.48550/ARXIV .2002.06823 (2020). Version Number: 1
work page internal anchor Pith review doi:10.48550/arxiv 2021
-
[53]
Ng, Q. X., Lim, S. R., Yau, C. E. & Liew, T. M. Examining the prevailing negative sentiments related to covid-19 vaccination: Unsupervised deep learning of twitter posts over a 16 month period.V accines10, DOI: 10.3390/vaccines10091457 (2022)
-
[54]
Wang, T., Lu, K., Chow, K. P. & Zhu, Q. Covid-19 sensing: Negative sentiment analysis on social media in china via bert model.IEEE Access8, 138162–138169, DOI: 10.1109/ACCESS.2020.3012595 (2020)
-
[55]
Nematzadeh, A., Ciampaglia, G. L., Ahn, Y .-Y . & Flammini, A. Information overload in group communication: from conversation to cacophony in the twitch chat.Royal Soc. Open Sci.6, 191412, DOI: 10.1098/rsos.191412 (2019)
-
[56]
A390, 2936–2944, DOI: 10.1016/j.physa.2011
Chmiel, A.et al.Negative emotions boost user activity at bbc forum.Phys. A390, 2936–2944, DOI: 10.1016/j.physa.2011. 03.040 (2011). Code availability The modified FakeBERT model used to produce the results presented in this work is available at: https://huggingface.co/jrawa/fake- distilbert-3class. Acknowledgements J.R.andJ.S.acknowledge support by POB Cy...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.