Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Jana Lasser; Kirill Solovev; Ruggero Marino Lazzaroni

arxiv: 2605.18337 · v1 · pith:AZBDH54Rnew · submitted 2026-05-18 · 💻 cs.CL

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Ruggero Marino Lazzaroni , Jana Lasser , Kirill Solovev This is my paper

Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords news corpusCommon Crawlinformation retrievallanguage detectiongeographic attributionsuffix arraylarge-scale datasetscomputational social science

0 comments

The pith

Infini-News gives researchers fast searchable access to 1.35 billion processed Common Crawl news articles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large news collections support studies of media, politics, and language trends, yet raw Common Crawl data demands heavy storage and processing. The paper builds Infini-News by extracting and cleaning text from the full CC-News archive since 2016, parsing metadata, and adding language tags from three classifiers plus country-of-origin labels for 83.4 percent of articles. It then constructs Infini-gram suffix-array indexes that return matches for any text pattern in sub-second time. This combination removes the need for individual teams to reprocess terabytes of data for longitudinal or cross-national research.

Core claim

We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. We extract, clean the text, and parse the structured metadata of over 1.35B articles. We enrich the corpus with language detection using three frontier language classifiers and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. We construct Infini-gram indexes that let researchers search the full archive for arbitrary text patterns in sub-second time.

What carries the argument

Infini-gram suffix-array indexes that deliver sub-second query times across the full 1.35 billion article corpus.

If this is right

Longitudinal studies of news coverage over years become feasible without each team building its own pipeline.
Cross-national media research can draw on country labels for most articles without additional attribution work.
NLP experiments gain a standardized, language-tagged news dataset at full Common Crawl scale.
Arbitrary phrase or pattern searches replace slow full-corpus scans for exploratory analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geographic tags could support new analyses of how coverage of the same event differs by country of origin.
Similar indexing methods might be applied to other large web archives to create comparable research resources.
Integration with existing NLP tools could let users run entity or sentiment queries directly against the indexed corpus.

Load-bearing premise

The extraction, cleaning, language detection, and geographic attribution steps produce data accurate enough for downstream research use.

What would settle it

A test that runs a set of known news-event queries on the live index, checks whether returned articles match independent ground-truth coverage and assigned countries, and confirms measured query latencies stay under one second.

Figures

Figures reproduced from arXiv: 2605.18337 by Jana Lasser, Kirill Solovev, Ruggero Marino Lazzaroni.

**Figure 1.** Figure 1: INFINI-NEWS corpus volume over time. Top panel (orange): articles per month, raw monthly count overlaid with a 3-month rolling mean. Bottom panel (blue): distinct hostnames seen per month. The drop in 2023 and after may be connected to the 2023 wave of CCBot disallow rules in news-publisher robots.txt files (Longpre et al. 2024), part of a broader AI-crawler blocking trend documented by Fletcher (2024). Th… view at source ↗

**Figure 2.** Figure 2: plots cumulative monthly article counts from August 2016 through April 2026 for the global total and three high-resource languages, with both series clipped to the same window. Across the global total INFINI-NEWS contains roughly 1.36 B articles, against Factiva’s 1.12 B; the lead is larger for English (507 M vs. 388 M) and Spanish (135 M vs. 118 M). For Russian, both indexes reach ∼ 88 M articles, with I… view at source ↗

**Figure 3.** Figure 3: Per-month barbell plot of crawled (orange, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents Infini-News, a retrieval toolkit and index for the full CC-News archive (August 2016 to latest snapshot) comprising over 1.35 billion articles. The authors detail extraction and cleaning of text and structured metadata, enrichment via language detection with GlotLID, lingua, and CommonLingua plus multi-source geographic attribution resolving a country for 83.4% of articles across 222 countries, and construction of Infini-gram suffix-array indexes claimed to support arbitrary-pattern searches in sub-second time.

Significance. If the data-processing pipeline produces sufficiently accurate output and the index performance claims are substantiated, the resource would meaningfully lower barriers to large-scale longitudinal and cross-national news analysis in computational social science and NLP. The scale of the processed corpus and the multi-classifier enrichment approach represent concrete contributions to open data infrastructure.

major comments (1)

[Abstract] Abstract: the claim that Infini-gram suffix-array indexes enable 'sub-second time' searches for arbitrary text patterns across the full 1.35 billion article corpus is presented without any reported query-latency measurements, index-construction cost, memory footprint, or hardware configuration. This is load-bearing for the central claim, because suffix-array performance at terabyte scale is sensitive to implementation choices (compressed vs. plain, disk vs. RAM, single-node vs. distributed) that remain unquantified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential value of Infini-News for large-scale news analysis. We address the single major comment below and will incorporate the requested details in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Infini-gram suffix-array indexes enable 'sub-second time' searches for arbitrary text patterns across the full 1.35 billion article corpus is presented without any reported query-latency measurements, index-construction cost, memory footprint, or hardware configuration. This is load-bearing for the central claim, because suffix-array performance at terabyte scale is sensitive to implementation choices (compressed vs. plain, disk vs. RAM, single-node vs. distributed) that remain unquantified.

Authors: We agree that the performance claim requires empirical support. The manuscript currently states the sub-second capability based on the theoretical efficiency of suffix arrays for exact pattern matching and on internal testing during index construction, but does not report quantitative benchmarks. In the revised manuscript we will add a new subsection (or appendix) that reports: (1) average and 95th-percentile query latencies for patterns of varying lengths on the full 1.35 B article index, (2) wall-clock time and compute cost for index construction, (3) peak memory footprint and storage size of the index, and (4) the exact hardware configuration (CPU, RAM, storage type, and whether the index resides in RAM or on disk). These measurements will be obtained by re-running the indexing and query pipeline on the production hardware and will be presented with clear methodology so readers can assess the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: paper is a systems/data contribution with no derivations or predictions

full rationale

The paper describes extraction, cleaning, language detection, geographic attribution, and construction of Infini-gram suffix-array indexes for the CC-News corpus. No mathematical derivations, equations, predictions, fitted parameters, or first-principles results are present that could reduce to inputs by construction. Performance claims about sub-second queries are engineering assertions about the built indexes rather than derived quantities. The work is self-contained as a toolkit presentation without load-bearing self-citations, ansatzes, or uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an engineering and data-curation project rather than a theoretical derivation. No free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption CC-News snapshots from Common Crawl contain extractable news articles with usable text and metadata.
The entire pipeline presupposes that the raw Common Crawl data is sufficiently structured and representative to support the claimed cleaning and enrichment steps.

pith-pipeline@v0.9.0 · 5718 in / 1263 out tokens · 54275 ms · 2026-05-20T11:46:44.827224+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

and Norenzayan, Ara , journaltitle =

Henrich, Joseph and Heine, Steven J. and Norenzayan, Ara , journaltitle =. The Weirdest People in the World? , doi =. Behavioral and Brain Sciences , month = jun, publisher =

work page
[2]

Five Sources of Bias in Natural Language Processing , doi =

Hovy, Dirk and Prabhumoye, Shrimai , journaltitle =. Five Sources of Bias in Natural Language Processing , doi =. Language and Linguistics Compass , month = aug, publisher =

work page
[3]

Common Crawl News Dataset , howpublished =

work page
[4]

A Comparison of News Databases

Gilbert, Stacy and Watkins, Alexander , journaltitle =. A Comparison of News Databases. Newspaper Research Journal , month = sep, publisher =. doi:10.1177/0739532920950039 , issn =

work page doi:10.1177/0739532920950039
[5]

Factiva , howpublished =

work page
[6]

Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation,

Le Pochat, Victor and Van Goethem, Tom and Tajalizadehkhoob, Samaneh and Korczy\'. Proceedings of the 26th Annual Network and Distributed System Security Symposium (. doi:10.14722/ndss.2019.23386 , year =

work page doi:10.14722/ndss.2019.23386 2019
[7]

LexisNexis , title =

work page
[8]

and Allum, Nick and Denman, Angella , title =

Metzler, Katie and Kim, David A. and Allum, Nick and Denman, Angella , title =. doi:10.4135/wp160926 , month = sep, publisher =

work page doi:10.4135/wp160926
[9]

The Past Web , title =

Costa, Miguel and Masan. The Past Web , title =. doi:10.1007/978-3-030-63291-5_21 , pages =

work page doi:10.1007/978-3-030-63291-5_21
[10]

and Napoli, Philip M

Weber, Matthew S. and Napoli, Philip M. , journaltitle =. Journalism History, Web Archives, and New Methods for Understanding the Evolution of Digital Journalism , doi =. Digital Journalism , month = sep, publisher =

work page
[11]

Information Diffusion between Dutch Cities: Revisiting Zipf and Pred Using a Computational Social Science Approach , doi =

Peris, Antoine and Meijers, Evert and van Ham, Maarten , journaltitle =. Information Diffusion between Dutch Cities: Revisiting Zipf and Pred Using a Computational Social Science Approach , doi =. Computers, Environment and Urban Systems , month = jan, publisher =

work page
[12]

Framing and Agenda-Setting in Russian News: A Computational Analysis of Intricate Political Strategies , doi =

Field, Anjalie and Kliger, Doron and Wintner, Shuly and Pan, Jennifer and Jurafsky, Dan and Tsvetkov, Yulia , booktitle =. Framing and Agenda-Setting in Russian News: A Computational Analysis of Intricate Political Strategies , doi =

work page
[13]

All Things Considered: Detecting Partisan Events from News Media with Cross-Article Comparison , doi =

Liu, Yujian and Zhang, Xinliang and Zou, Kaijian and Huang, Ruihong and Beauchamp, Nicholas and Wang, Lu , booktitle =. All Things Considered: Detecting Partisan Events from News Media with Cross-Article Comparison , doi =

work page
[14]

Mapping the Global Election Landscape on Social Media in 2024 , doi =

Pecile, Giulio and Di Marco, Niccol. Mapping the Global Election Landscape on Social Media in 2024 , doi =. PLOS ONE , month = feb, publisher =

work page 2024
[15]

News Coverage of the COVID-19 Pandemic on Social Media and the Public

Wang, Hanjing and Li, Yupeng and Ning, Xuan , date =. News Coverage of the COVID-19 Pandemic on Social Media and the Public. Journal of Medical Internet Research , keywords =. doi:10.2196/48491 , issn =

work page doi:10.2196/48491
[16]

VaccinItaly: Monitoring Italian Conversations around Vaccines on Twitter and Facebook , doi =

Pierri, Francesco and Tocchetti, Andrea and Corti, Lorenzo and Di Giovanni, Marco and Pavanetto, Silvio and Brambilla, Marco and Ceri, Stefano , date =. VaccinItaly: Monitoring Italian Conversations around Vaccines on Twitter and Facebook , doi =

work page
[17]

Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War between Ukraine and Russia , doi =

Chen, Emily and Ferrara, Emilio , booktitle =. Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War between Ukraine and Russia , doi =

work page
[18]

Challenges and Strategies in Cross-Cultural NLP , doi =

Hershcovich, Daniel and Frank, Stella and Lent, Heather and de Lhoneux, Miryam and Abdou, Mostafa and Brandl, Stephanie and Bugliarello, Emanuele and Cabello Piqueras, Laura and Chalkidis, Ilias and Cui, Ruixiang and Fierro, Constanza and Margatina, Katerina and Rust, Phillip and S. Challenges and Strategies in Cross-Cultural NLP , doi =. Proceedings of t...

work page
[19]

and Dietrich, Nick , journaltitle =

Karstens, Mikaela and Soules, Michael J. and Dietrich, Nick , journaltitle =. On the Replicability of Data Collection Using Online News Databases , doi =. PS: Political Science & Politics , month = jan, publisher =

work page
[20]

and Neumayer, Christina and Mercea, Dan , journaltitle =

Hoffmann, Matthias and Santos, Felipe G. and Neumayer, Christina and Mercea, Dan , journaltitle =. Lifting the Veil on the Use of Big Data News Repositories: A Documentation and Critical Discussion of a Protest Event Analysis , doi =. Communication Methods and Measures , month = sep, publisher =

work page
[21]

Infini-Gram: Scaling Unbounded N-Gram Language Models to a Trillion Tokens , doi =

Liu, Jiacheng and Min, Sewon and Zettlemoyer, Luke and Choi, Yejin and Hajishirzi, Hannaneh , booktitle =. Infini-Gram: Scaling Unbounded N-Gram Language Models to a Trillion Tokens , doi =

work page
[22]

and Hajishirzi, Hannaneh , booktitle =

Xu, Hao and Liu, Jiacheng and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh , booktitle =. Infini-Gram Mini: Exact n-gram Search at the Internet Scale with FM-Index , doi =. 2025 , note =

work page 2025
[23]

and Bishop, Cindy Sherman and Ndulue, Emily B

Roberts, Hal and Bhargava, Rahul and Valiukas, Linas and Jen, Dennis and Malik, Momin M. and Bishop, Cindy Sherman and Ndulue, Emily B. and Dave, Aashka and Clark, Justin and Etling, Bruce and Faris, Robert and Shah, Anushka and Rubinovitz, Jasmin and Hope, Alexis and D. Media Cloud: Massive Open Source Collection of Global News on the Open Web , doi =. P...

work page
[24]

Datasheets for Datasets , doi =

Gebru, Timnit and Morgenstern, Jamie and Vecchione, Briana and Vaughan, Jennifer Wortman and Wallach, Hanna and. Datasheets for Datasets , doi =. Communications of the ACM , month = nov, publisher =

work page
[25]

News-Please: A Generic News Crawler and Extractor , pages =

Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , booktitle =. News-Please: A Generic News Crawler and Extractor , pages =

work page
[26]

Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions , doi =

Dallabetta, Max and Dobberstein, Conrad and Breiding, Adrian and Akbik, Alan , booktitle =. Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions , doi =

work page
[27]

and Butler, Brandon and Carroll, Michael and Cohen-Sasson, Or and Craig, Carys and Guibault, Lucie and Jaszi, Peter and J

Fiil-Flynn, Sean M. and Butler, Brandon and Carroll, Michael and Cohen-Sasson, Or and Craig, Carys and Guibault, Lucie and Jaszi, Peter and J. Legal Reform to Enhance Global Text and Data Mining Research , doi =. Science , month = dec, publisher =

work page
[28]

Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , doi =

Barbaresi, Adrien , booktitle =. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , doi =

work page
[29]

GlotLID: Language Identification for Low-Resource Languages , doi =

Kargaran, Amir and Imani, Ayyoob and Yvon, Fran. GlotLID: Language Identification for Low-Resource Languages , doi =. Findings of the ACL: EMNLP 2023 , date =

work page 2023
[30]

, title =

Stahl, Peter M. , title =

work page
[31]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , doi =

Penedo, Guilherme and Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , doi =. Proceedings of the NeurIPS , date =

work page
[32]

FastWARC: Optimizing Large-Scale Web Archive Analytics , year =

Bevendorff, Janek and Potthast, Martin and Stein, Benno , booktitle =. FastWARC: Optimizing Large-Scale Web Archive Analytics , year =

work page
[33]

Proceedings of the WSDM , title =

Kohlsch. Proceedings of the WSDM , title =. doi:10.1145/1718487.1718542 , pages =

work page doi:10.1145/1718487.1718542
[34]

TeleScope: A Longitudinal Dataset for Investigating Online Discourse and Information Interaction on Telegram , doi =

Gangopadhyay, Susmita and Dess. TeleScope: A Longitudinal Dataset for Investigating Online Discourse and Information Interaction on Telegram , doi =. Proceedings of the ICWSM , date =

work page
[35]

News on TikTok: An Annotated Dataset of TikTok Videos from German-Speaking News Outlets in 2023 , doi =

Mayer, Anna-Theresa and Wedel, Lion and Batzner, Jan and Hendrickx, Jonathan and Bremer, Emma and Iwan, Alexander and Stocker, Volker and Ohme, Jakob , booktitle =. News on TikTok: An Annotated Dataset of TikTok Videos from German-Speaking News Outlets in 2023 , doi =

work page 2023
[36]

Proceedings of the ICWSM , title =

Haouari, Fatima and Scarton, Carolina and Faggiani, Nicol. Proceedings of the ICWSM , title =. doi:10.1609/icwsm.v19i1.35950 , number =

work page doi:10.1609/icwsm.v19i1.35950
[37]

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , doi =

Grusky, Max and Naaman, Mor and Artzi, Yoav , booktitle =. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , doi =

work page
[38]

and Culpepper, J

Mackenzie, Joel and Benham, Rodger and Petri, Matthias and Trippas, Johanne R. and Culpepper, J. Shane and Moffat, Alistair , booktitle =. CC-News-En: A Large English News Corpus , doi =

work page
[39]

Proceedings of the ICWSM , title =

N. Proceedings of the ICWSM , title =. doi:10.1609/icwsm.v13i01.3261 , language =

work page doi:10.1609/icwsm.v13i01.3261
[40]

Moralized Language Predicts Hate Speech on Social Media , doi =

Solovev, Kirill and Pr. Moralized Language Predicts Hate Speech on Social Media , doi =. PNAS Nexus , langid =

work page
[41]

Proceedings of the LREC , title =

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the LREC , title =

work page
[42]

Consent in Crisis: The Rapid Decline of the AI Data Commons , doi =

Longpre, Shayne and Mahari, Robert and Lee, Ariel and Lund, Campbell and Oderinwale, Hamidah and Brannon, William and Saxena, Nayan and Obeng-Marnu, Naana and South, Tobin and Hunter, Cole and Klyman, Kevin and Klamm, Christopher and Schoelkopf, Hailey and Singh, Nikhil and Cherep, Manuel and Anis, Ahmad and Dinh, An and Chitongo, Caroline and Yin, Da and...

work page
[43]

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , doi =

Dodge, Jesse and Sap, Maarten and Marasovi. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , doi =. Proceedings of the EMNLP , date =

work page
[44]

The Schwurbelarchiv: a German Language Telegram dataset for the Study of Conspiracy Theories

Angermaier, Mathias and Hoeldrich, Elisabeth and Lasser, Jana and. The Schwurbelarchiv: A German Language Telegram Dataset for the Study of Conspiracy Theories , doi =. arXiv , copyright =:2504.06318 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[45]

doi:10.60625/risj-xm9g-ws87 , year =

Fletcher, Richard , title =. doi:10.60625/risj-xm9g-ws87 , year =

work page doi:10.60625/risj-xm9g-ws87
[46]

Nioche, Julien , title =

work page
[47]

and Alfano, Mark and Barfuss, Wolfram and Bergstrom, Carl T

Bak-Coleman, Joseph B. and Alfano, Mark and Barfuss, Wolfram and Bergstrom, Carl T. and Centeno, Miguel A. and Couzin, Iain D. and Donges, Jonathan F. and Galesic, Mirta and Gersick, Andrew S. and Jacquet, Jennifer and Kao, Albert B. and Moran, Rachel E. and Romanczuk, Pawel and Rubenstein, Daniel I. and Tombak, Kaia J. and Van Bavel, Jay J. and Weber, El...

work page
[48]

2026 , howpublished =

work page 2026

[1] [1]

and Norenzayan, Ara , journaltitle =

Henrich, Joseph and Heine, Steven J. and Norenzayan, Ara , journaltitle =. The Weirdest People in the World? , doi =. Behavioral and Brain Sciences , month = jun, publisher =

work page

[2] [2]

Five Sources of Bias in Natural Language Processing , doi =

Hovy, Dirk and Prabhumoye, Shrimai , journaltitle =. Five Sources of Bias in Natural Language Processing , doi =. Language and Linguistics Compass , month = aug, publisher =

work page

[3] [3]

Common Crawl News Dataset , howpublished =

work page

[4] [4]

A Comparison of News Databases

Gilbert, Stacy and Watkins, Alexander , journaltitle =. A Comparison of News Databases. Newspaper Research Journal , month = sep, publisher =. doi:10.1177/0739532920950039 , issn =

work page doi:10.1177/0739532920950039

[5] [5]

Factiva , howpublished =

work page

[6] [6]

Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation,

Le Pochat, Victor and Van Goethem, Tom and Tajalizadehkhoob, Samaneh and Korczy\'. Proceedings of the 26th Annual Network and Distributed System Security Symposium (. doi:10.14722/ndss.2019.23386 , year =

work page doi:10.14722/ndss.2019.23386 2019

[7] [7]

LexisNexis , title =

work page

[8] [8]

and Allum, Nick and Denman, Angella , title =

Metzler, Katie and Kim, David A. and Allum, Nick and Denman, Angella , title =. doi:10.4135/wp160926 , month = sep, publisher =

work page doi:10.4135/wp160926

[9] [9]

The Past Web , title =

Costa, Miguel and Masan. The Past Web , title =. doi:10.1007/978-3-030-63291-5_21 , pages =

work page doi:10.1007/978-3-030-63291-5_21

[10] [10]

and Napoli, Philip M

Weber, Matthew S. and Napoli, Philip M. , journaltitle =. Journalism History, Web Archives, and New Methods for Understanding the Evolution of Digital Journalism , doi =. Digital Journalism , month = sep, publisher =

work page

[11] [11]

Information Diffusion between Dutch Cities: Revisiting Zipf and Pred Using a Computational Social Science Approach , doi =

Peris, Antoine and Meijers, Evert and van Ham, Maarten , journaltitle =. Information Diffusion between Dutch Cities: Revisiting Zipf and Pred Using a Computational Social Science Approach , doi =. Computers, Environment and Urban Systems , month = jan, publisher =

work page

[12] [12]

Framing and Agenda-Setting in Russian News: A Computational Analysis of Intricate Political Strategies , doi =

Field, Anjalie and Kliger, Doron and Wintner, Shuly and Pan, Jennifer and Jurafsky, Dan and Tsvetkov, Yulia , booktitle =. Framing and Agenda-Setting in Russian News: A Computational Analysis of Intricate Political Strategies , doi =

work page

[13] [13]

All Things Considered: Detecting Partisan Events from News Media with Cross-Article Comparison , doi =

Liu, Yujian and Zhang, Xinliang and Zou, Kaijian and Huang, Ruihong and Beauchamp, Nicholas and Wang, Lu , booktitle =. All Things Considered: Detecting Partisan Events from News Media with Cross-Article Comparison , doi =

work page

[14] [14]

Mapping the Global Election Landscape on Social Media in 2024 , doi =

Pecile, Giulio and Di Marco, Niccol. Mapping the Global Election Landscape on Social Media in 2024 , doi =. PLOS ONE , month = feb, publisher =

work page 2024

[15] [15]

News Coverage of the COVID-19 Pandemic on Social Media and the Public

Wang, Hanjing and Li, Yupeng and Ning, Xuan , date =. News Coverage of the COVID-19 Pandemic on Social Media and the Public. Journal of Medical Internet Research , keywords =. doi:10.2196/48491 , issn =

work page doi:10.2196/48491

[16] [16]

VaccinItaly: Monitoring Italian Conversations around Vaccines on Twitter and Facebook , doi =

Pierri, Francesco and Tocchetti, Andrea and Corti, Lorenzo and Di Giovanni, Marco and Pavanetto, Silvio and Brambilla, Marco and Ceri, Stefano , date =. VaccinItaly: Monitoring Italian Conversations around Vaccines on Twitter and Facebook , doi =

work page

[17] [17]

Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War between Ukraine and Russia , doi =

Chen, Emily and Ferrara, Emilio , booktitle =. Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War between Ukraine and Russia , doi =

work page

[18] [18]

Challenges and Strategies in Cross-Cultural NLP , doi =

Hershcovich, Daniel and Frank, Stella and Lent, Heather and de Lhoneux, Miryam and Abdou, Mostafa and Brandl, Stephanie and Bugliarello, Emanuele and Cabello Piqueras, Laura and Chalkidis, Ilias and Cui, Ruixiang and Fierro, Constanza and Margatina, Katerina and Rust, Phillip and S. Challenges and Strategies in Cross-Cultural NLP , doi =. Proceedings of t...

work page

[19] [19]

and Dietrich, Nick , journaltitle =

Karstens, Mikaela and Soules, Michael J. and Dietrich, Nick , journaltitle =. On the Replicability of Data Collection Using Online News Databases , doi =. PS: Political Science & Politics , month = jan, publisher =

work page

[20] [20]

and Neumayer, Christina and Mercea, Dan , journaltitle =

Hoffmann, Matthias and Santos, Felipe G. and Neumayer, Christina and Mercea, Dan , journaltitle =. Lifting the Veil on the Use of Big Data News Repositories: A Documentation and Critical Discussion of a Protest Event Analysis , doi =. Communication Methods and Measures , month = sep, publisher =

work page

[21] [21]

Infini-Gram: Scaling Unbounded N-Gram Language Models to a Trillion Tokens , doi =

Liu, Jiacheng and Min, Sewon and Zettlemoyer, Luke and Choi, Yejin and Hajishirzi, Hannaneh , booktitle =. Infini-Gram: Scaling Unbounded N-Gram Language Models to a Trillion Tokens , doi =

work page

[22] [22]

and Hajishirzi, Hannaneh , booktitle =

Xu, Hao and Liu, Jiacheng and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh , booktitle =. Infini-Gram Mini: Exact n-gram Search at the Internet Scale with FM-Index , doi =. 2025 , note =

work page 2025

[23] [23]

and Bishop, Cindy Sherman and Ndulue, Emily B

Roberts, Hal and Bhargava, Rahul and Valiukas, Linas and Jen, Dennis and Malik, Momin M. and Bishop, Cindy Sherman and Ndulue, Emily B. and Dave, Aashka and Clark, Justin and Etling, Bruce and Faris, Robert and Shah, Anushka and Rubinovitz, Jasmin and Hope, Alexis and D. Media Cloud: Massive Open Source Collection of Global News on the Open Web , doi =. P...

work page

[24] [24]

Datasheets for Datasets , doi =

Gebru, Timnit and Morgenstern, Jamie and Vecchione, Briana and Vaughan, Jennifer Wortman and Wallach, Hanna and. Datasheets for Datasets , doi =. Communications of the ACM , month = nov, publisher =

work page

[25] [25]

News-Please: A Generic News Crawler and Extractor , pages =

Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , booktitle =. News-Please: A Generic News Crawler and Extractor , pages =

work page

[26] [26]

Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions , doi =

Dallabetta, Max and Dobberstein, Conrad and Breiding, Adrian and Akbik, Alan , booktitle =. Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions , doi =

work page

[27] [27]

and Butler, Brandon and Carroll, Michael and Cohen-Sasson, Or and Craig, Carys and Guibault, Lucie and Jaszi, Peter and J

Fiil-Flynn, Sean M. and Butler, Brandon and Carroll, Michael and Cohen-Sasson, Or and Craig, Carys and Guibault, Lucie and Jaszi, Peter and J. Legal Reform to Enhance Global Text and Data Mining Research , doi =. Science , month = dec, publisher =

work page

[28] [28]

Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , doi =

Barbaresi, Adrien , booktitle =. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , doi =

work page

[29] [29]

GlotLID: Language Identification for Low-Resource Languages , doi =

Kargaran, Amir and Imani, Ayyoob and Yvon, Fran. GlotLID: Language Identification for Low-Resource Languages , doi =. Findings of the ACL: EMNLP 2023 , date =

work page 2023

[30] [30]

, title =

Stahl, Peter M. , title =

work page

[31] [31]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , doi =

Penedo, Guilherme and Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , doi =. Proceedings of the NeurIPS , date =

work page

[32] [32]

FastWARC: Optimizing Large-Scale Web Archive Analytics , year =

Bevendorff, Janek and Potthast, Martin and Stein, Benno , booktitle =. FastWARC: Optimizing Large-Scale Web Archive Analytics , year =

work page

[33] [33]

Proceedings of the WSDM , title =

Kohlsch. Proceedings of the WSDM , title =. doi:10.1145/1718487.1718542 , pages =

work page doi:10.1145/1718487.1718542

[34] [34]

TeleScope: A Longitudinal Dataset for Investigating Online Discourse and Information Interaction on Telegram , doi =

Gangopadhyay, Susmita and Dess. TeleScope: A Longitudinal Dataset for Investigating Online Discourse and Information Interaction on Telegram , doi =. Proceedings of the ICWSM , date =

work page

[35] [35]

News on TikTok: An Annotated Dataset of TikTok Videos from German-Speaking News Outlets in 2023 , doi =

Mayer, Anna-Theresa and Wedel, Lion and Batzner, Jan and Hendrickx, Jonathan and Bremer, Emma and Iwan, Alexander and Stocker, Volker and Ohme, Jakob , booktitle =. News on TikTok: An Annotated Dataset of TikTok Videos from German-Speaking News Outlets in 2023 , doi =

work page 2023

[36] [36]

Proceedings of the ICWSM , title =

Haouari, Fatima and Scarton, Carolina and Faggiani, Nicol. Proceedings of the ICWSM , title =. doi:10.1609/icwsm.v19i1.35950 , number =

work page doi:10.1609/icwsm.v19i1.35950

[37] [37]

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , doi =

Grusky, Max and Naaman, Mor and Artzi, Yoav , booktitle =. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , doi =

work page

[38] [38]

and Culpepper, J

Mackenzie, Joel and Benham, Rodger and Petri, Matthias and Trippas, Johanne R. and Culpepper, J. Shane and Moffat, Alistair , booktitle =. CC-News-En: A Large English News Corpus , doi =

work page

[39] [39]

Proceedings of the ICWSM , title =

N. Proceedings of the ICWSM , title =. doi:10.1609/icwsm.v13i01.3261 , language =

work page doi:10.1609/icwsm.v13i01.3261

[40] [40]

Moralized Language Predicts Hate Speech on Social Media , doi =

Solovev, Kirill and Pr. Moralized Language Predicts Hate Speech on Social Media , doi =. PNAS Nexus , langid =

work page

[41] [41]

Proceedings of the LREC , title =

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the LREC , title =

work page

[42] [42]

Consent in Crisis: The Rapid Decline of the AI Data Commons , doi =

Longpre, Shayne and Mahari, Robert and Lee, Ariel and Lund, Campbell and Oderinwale, Hamidah and Brannon, William and Saxena, Nayan and Obeng-Marnu, Naana and South, Tobin and Hunter, Cole and Klyman, Kevin and Klamm, Christopher and Schoelkopf, Hailey and Singh, Nikhil and Cherep, Manuel and Anis, Ahmad and Dinh, An and Chitongo, Caroline and Yin, Da and...

work page

[43] [43]

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , doi =

Dodge, Jesse and Sap, Maarten and Marasovi. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , doi =. Proceedings of the EMNLP , date =

work page

[44] [44]

The Schwurbelarchiv: a German Language Telegram dataset for the Study of Conspiracy Theories

Angermaier, Mathias and Hoeldrich, Elisabeth and Lasser, Jana and. The Schwurbelarchiv: A German Language Telegram Dataset for the Study of Conspiracy Theories , doi =. arXiv , copyright =:2504.06318 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

doi:10.60625/risj-xm9g-ws87 , year =

Fletcher, Richard , title =. doi:10.60625/risj-xm9g-ws87 , year =

work page doi:10.60625/risj-xm9g-ws87

[46] [46]

Nioche, Julien , title =

work page

[47] [47]

and Alfano, Mark and Barfuss, Wolfram and Bergstrom, Carl T

Bak-Coleman, Joseph B. and Alfano, Mark and Barfuss, Wolfram and Bergstrom, Carl T. and Centeno, Miguel A. and Couzin, Iain D. and Donges, Jonathan F. and Galesic, Mirta and Gersick, Andrew S. and Jacquet, Jennifer and Kao, Albert B. and Moran, Rachel E. and Romanczuk, Pawel and Rubenstein, Daniel I. and Tombak, Kaia J. and Van Bavel, Jay J. and Weber, El...

work page

[48] [48]

2026 , howpublished =

work page 2026