Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3
The pith
Infini-News gives researchers fast searchable access to 1.35 billion processed Common Crawl news articles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. We extract, clean the text, and parse the structured metadata of over 1.35B articles. We enrich the corpus with language detection using three frontier language classifiers and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. We construct Infini-gram indexes that let researchers search the full archive for arbitrary text patterns in sub-second time.
What carries the argument
Infini-gram suffix-array indexes that deliver sub-second query times across the full 1.35 billion article corpus.
If this is right
- Longitudinal studies of news coverage over years become feasible without each team building its own pipeline.
- Cross-national media research can draw on country labels for most articles without additional attribution work.
- NLP experiments gain a standardized, language-tagged news dataset at full Common Crawl scale.
- Arbitrary phrase or pattern searches replace slow full-corpus scans for exploratory analysis.
Where Pith is reading between the lines
- The geographic tags could support new analyses of how coverage of the same event differs by country of origin.
- Similar indexing methods might be applied to other large web archives to create comparable research resources.
- Integration with existing NLP tools could let users run entity or sentiment queries directly against the indexed corpus.
Load-bearing premise
The extraction, cleaning, language detection, and geographic attribution steps produce data accurate enough for downstream research use.
What would settle it
A test that runs a set of known news-event queries on the live index, checks whether returned articles match independent ground-truth coverage and assigned countries, and confirms measured query latencies stay under one second.
Figures
read the original abstract
Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Infini-News, a retrieval toolkit and index for the full CC-News archive (August 2016 to latest snapshot) comprising over 1.35 billion articles. The authors detail extraction and cleaning of text and structured metadata, enrichment via language detection with GlotLID, lingua, and CommonLingua plus multi-source geographic attribution resolving a country for 83.4% of articles across 222 countries, and construction of Infini-gram suffix-array indexes claimed to support arbitrary-pattern searches in sub-second time.
Significance. If the data-processing pipeline produces sufficiently accurate output and the index performance claims are substantiated, the resource would meaningfully lower barriers to large-scale longitudinal and cross-national news analysis in computational social science and NLP. The scale of the processed corpus and the multi-classifier enrichment approach represent concrete contributions to open data infrastructure.
major comments (1)
- [Abstract] Abstract: the claim that Infini-gram suffix-array indexes enable 'sub-second time' searches for arbitrary text patterns across the full 1.35 billion article corpus is presented without any reported query-latency measurements, index-construction cost, memory footprint, or hardware configuration. This is load-bearing for the central claim, because suffix-array performance at terabyte scale is sensitive to implementation choices (compressed vs. plain, disk vs. RAM, single-node vs. distributed) that remain unquantified.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential value of Infini-News for large-scale news analysis. We address the single major comment below and will incorporate the requested details in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that Infini-gram suffix-array indexes enable 'sub-second time' searches for arbitrary text patterns across the full 1.35 billion article corpus is presented without any reported query-latency measurements, index-construction cost, memory footprint, or hardware configuration. This is load-bearing for the central claim, because suffix-array performance at terabyte scale is sensitive to implementation choices (compressed vs. plain, disk vs. RAM, single-node vs. distributed) that remain unquantified.
Authors: We agree that the performance claim requires empirical support. The manuscript currently states the sub-second capability based on the theoretical efficiency of suffix arrays for exact pattern matching and on internal testing during index construction, but does not report quantitative benchmarks. In the revised manuscript we will add a new subsection (or appendix) that reports: (1) average and 95th-percentile query latencies for patterns of varying lengths on the full 1.35 B article index, (2) wall-clock time and compute cost for index construction, (3) peak memory footprint and storage size of the index, and (4) the exact hardware configuration (CPU, RAM, storage type, and whether the index resides in RAM or on disk). These measurements will be obtained by re-running the indexing and query pipeline on the production hardware and will be presented with clear methodology so readers can assess the claims. revision: yes
Circularity Check
No circularity: paper is a systems/data contribution with no derivations or predictions
full rationale
The paper describes extraction, cleaning, language detection, geographic attribution, and construction of Infini-gram suffix-array indexes for the CC-News corpus. No mathematical derivations, equations, predictions, fitted parameters, or first-principles results are present that could reduce to inputs by construction. Performance claims about sub-second queries are engineering assertions about the built indexes rather than derived quantities. The work is self-contained as a toolkit presentation without load-bearing self-citations, ansatzes, or uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CC-News snapshots from Common Crawl contain extractable news articles with usable text and metadata.
Reference graph
Works this paper leans on
-
[1]
and Norenzayan, Ara , journaltitle =
Henrich, Joseph and Heine, Steven J. and Norenzayan, Ara , journaltitle =. The Weirdest People in the World? , doi =. Behavioral and Brain Sciences , month = jun, publisher =
-
[2]
Five Sources of Bias in Natural Language Processing , doi =
Hovy, Dirk and Prabhumoye, Shrimai , journaltitle =. Five Sources of Bias in Natural Language Processing , doi =. Language and Linguistics Compass , month = aug, publisher =
-
[3]
Common Crawl News Dataset , howpublished =
-
[4]
A Comparison of News Databases
Gilbert, Stacy and Watkins, Alexander , journaltitle =. A Comparison of News Databases. Newspaper Research Journal , month = sep, publisher =. doi:10.1177/0739532920950039 , issn =
-
[5]
Factiva , howpublished =
-
[6]
Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation,
Le Pochat, Victor and Van Goethem, Tom and Tajalizadehkhoob, Samaneh and Korczy\'. Proceedings of the 26th Annual Network and Distributed System Security Symposium (. doi:10.14722/ndss.2019.23386 , year =
-
[7]
LexisNexis , title =
-
[8]
and Allum, Nick and Denman, Angella , title =
Metzler, Katie and Kim, David A. and Allum, Nick and Denman, Angella , title =. doi:10.4135/wp160926 , month = sep, publisher =
-
[9]
Costa, Miguel and Masan. The Past Web , title =. doi:10.1007/978-3-030-63291-5_21 , pages =
-
[10]
Weber, Matthew S. and Napoli, Philip M. , journaltitle =. Journalism History, Web Archives, and New Methods for Understanding the Evolution of Digital Journalism , doi =. Digital Journalism , month = sep, publisher =
-
[11]
Peris, Antoine and Meijers, Evert and van Ham, Maarten , journaltitle =. Information Diffusion between Dutch Cities: Revisiting Zipf and Pred Using a Computational Social Science Approach , doi =. Computers, Environment and Urban Systems , month = jan, publisher =
-
[12]
Field, Anjalie and Kliger, Doron and Wintner, Shuly and Pan, Jennifer and Jurafsky, Dan and Tsvetkov, Yulia , booktitle =. Framing and Agenda-Setting in Russian News: A Computational Analysis of Intricate Political Strategies , doi =
-
[13]
Liu, Yujian and Zhang, Xinliang and Zou, Kaijian and Huang, Ruihong and Beauchamp, Nicholas and Wang, Lu , booktitle =. All Things Considered: Detecting Partisan Events from News Media with Cross-Article Comparison , doi =
-
[14]
Mapping the Global Election Landscape on Social Media in 2024 , doi =
Pecile, Giulio and Di Marco, Niccol. Mapping the Global Election Landscape on Social Media in 2024 , doi =. PLOS ONE , month = feb, publisher =
work page 2024
-
[15]
News Coverage of the COVID-19 Pandemic on Social Media and the Public
Wang, Hanjing and Li, Yupeng and Ning, Xuan , date =. News Coverage of the COVID-19 Pandemic on Social Media and the Public. Journal of Medical Internet Research , keywords =. doi:10.2196/48491 , issn =
-
[16]
VaccinItaly: Monitoring Italian Conversations around Vaccines on Twitter and Facebook , doi =
Pierri, Francesco and Tocchetti, Andrea and Corti, Lorenzo and Di Giovanni, Marco and Pavanetto, Silvio and Brambilla, Marco and Ceri, Stefano , date =. VaccinItaly: Monitoring Italian Conversations around Vaccines on Twitter and Facebook , doi =
-
[17]
Chen, Emily and Ferrara, Emilio , booktitle =. Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War between Ukraine and Russia , doi =
-
[18]
Challenges and Strategies in Cross-Cultural NLP , doi =
Hershcovich, Daniel and Frank, Stella and Lent, Heather and de Lhoneux, Miryam and Abdou, Mostafa and Brandl, Stephanie and Bugliarello, Emanuele and Cabello Piqueras, Laura and Chalkidis, Ilias and Cui, Ruixiang and Fierro, Constanza and Margatina, Katerina and Rust, Phillip and S. Challenges and Strategies in Cross-Cultural NLP , doi =. Proceedings of t...
-
[19]
and Dietrich, Nick , journaltitle =
Karstens, Mikaela and Soules, Michael J. and Dietrich, Nick , journaltitle =. On the Replicability of Data Collection Using Online News Databases , doi =. PS: Political Science & Politics , month = jan, publisher =
-
[20]
and Neumayer, Christina and Mercea, Dan , journaltitle =
Hoffmann, Matthias and Santos, Felipe G. and Neumayer, Christina and Mercea, Dan , journaltitle =. Lifting the Veil on the Use of Big Data News Repositories: A Documentation and Critical Discussion of a Protest Event Analysis , doi =. Communication Methods and Measures , month = sep, publisher =
-
[21]
Infini-Gram: Scaling Unbounded N-Gram Language Models to a Trillion Tokens , doi =
Liu, Jiacheng and Min, Sewon and Zettlemoyer, Luke and Choi, Yejin and Hajishirzi, Hannaneh , booktitle =. Infini-Gram: Scaling Unbounded N-Gram Language Models to a Trillion Tokens , doi =
-
[22]
and Hajishirzi, Hannaneh , booktitle =
Xu, Hao and Liu, Jiacheng and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh , booktitle =. Infini-Gram Mini: Exact n-gram Search at the Internet Scale with FM-Index , doi =. 2025 , note =
work page 2025
-
[23]
and Bishop, Cindy Sherman and Ndulue, Emily B
Roberts, Hal and Bhargava, Rahul and Valiukas, Linas and Jen, Dennis and Malik, Momin M. and Bishop, Cindy Sherman and Ndulue, Emily B. and Dave, Aashka and Clark, Justin and Etling, Bruce and Faris, Robert and Shah, Anushka and Rubinovitz, Jasmin and Hope, Alexis and D. Media Cloud: Massive Open Source Collection of Global News on the Open Web , doi =. P...
-
[24]
Datasheets for Datasets , doi =
Gebru, Timnit and Morgenstern, Jamie and Vecchione, Briana and Vaughan, Jennifer Wortman and Wallach, Hanna and. Datasheets for Datasets , doi =. Communications of the ACM , month = nov, publisher =
-
[25]
News-Please: A Generic News Crawler and Extractor , pages =
Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , booktitle =. News-Please: A Generic News Crawler and Extractor , pages =
-
[26]
Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions , doi =
Dallabetta, Max and Dobberstein, Conrad and Breiding, Adrian and Akbik, Alan , booktitle =. Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions , doi =
-
[27]
Fiil-Flynn, Sean M. and Butler, Brandon and Carroll, Michael and Cohen-Sasson, Or and Craig, Carys and Guibault, Lucie and Jaszi, Peter and J. Legal Reform to Enhance Global Text and Data Mining Research , doi =. Science , month = dec, publisher =
-
[28]
Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , doi =
Barbaresi, Adrien , booktitle =. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , doi =
-
[29]
GlotLID: Language Identification for Low-Resource Languages , doi =
Kargaran, Amir and Imani, Ayyoob and Yvon, Fran. GlotLID: Language Identification for Low-Resource Languages , doi =. Findings of the ACL: EMNLP 2023 , date =
work page 2023
- [30]
-
[31]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , doi =
Penedo, Guilherme and Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , doi =. Proceedings of the NeurIPS , date =
-
[32]
FastWARC: Optimizing Large-Scale Web Archive Analytics , year =
Bevendorff, Janek and Potthast, Martin and Stein, Benno , booktitle =. FastWARC: Optimizing Large-Scale Web Archive Analytics , year =
-
[33]
Proceedings of the WSDM , title =
Kohlsch. Proceedings of the WSDM , title =. doi:10.1145/1718487.1718542 , pages =
-
[34]
Gangopadhyay, Susmita and Dess. TeleScope: A Longitudinal Dataset for Investigating Online Discourse and Information Interaction on Telegram , doi =. Proceedings of the ICWSM , date =
-
[35]
Mayer, Anna-Theresa and Wedel, Lion and Batzner, Jan and Hendrickx, Jonathan and Bremer, Emma and Iwan, Alexander and Stocker, Volker and Ohme, Jakob , booktitle =. News on TikTok: An Annotated Dataset of TikTok Videos from German-Speaking News Outlets in 2023 , doi =
work page 2023
-
[36]
Proceedings of the ICWSM , title =
Haouari, Fatima and Scarton, Carolina and Faggiani, Nicol. Proceedings of the ICWSM , title =. doi:10.1609/icwsm.v19i1.35950 , number =
-
[37]
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , doi =
Grusky, Max and Naaman, Mor and Artzi, Yoav , booktitle =. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies , doi =
-
[38]
Mackenzie, Joel and Benham, Rodger and Petri, Matthias and Trippas, Johanne R. and Culpepper, J. Shane and Moffat, Alistair , booktitle =. CC-News-En: A Large English News Corpus , doi =
-
[39]
Proceedings of the ICWSM , title =
N. Proceedings of the ICWSM , title =. doi:10.1609/icwsm.v13i01.3261 , language =
-
[40]
Moralized Language Predicts Hate Speech on Social Media , doi =
Solovev, Kirill and Pr. Moralized Language Predicts Hate Speech on Social Media , doi =. PNAS Nexus , langid =
-
[41]
Proceedings of the LREC , title =
Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the LREC , title =
-
[42]
Consent in Crisis: The Rapid Decline of the AI Data Commons , doi =
Longpre, Shayne and Mahari, Robert and Lee, Ariel and Lund, Campbell and Oderinwale, Hamidah and Brannon, William and Saxena, Nayan and Obeng-Marnu, Naana and South, Tobin and Hunter, Cole and Klyman, Kevin and Klamm, Christopher and Schoelkopf, Hailey and Singh, Nikhil and Cherep, Manuel and Anis, Ahmad and Dinh, An and Chitongo, Caroline and Yin, Da and...
-
[43]
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , doi =
Dodge, Jesse and Sap, Maarten and Marasovi. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , doi =. Proceedings of the EMNLP , date =
-
[44]
The Schwurbelarchiv: a German Language Telegram dataset for the Study of Conspiracy Theories
Angermaier, Mathias and Hoeldrich, Elisabeth and Lasser, Jana and. The Schwurbelarchiv: A German Language Telegram Dataset for the Study of Conspiracy Theories , doi =. arXiv , copyright =:2504.06318 , eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
doi:10.60625/risj-xm9g-ws87 , year =
Fletcher, Richard , title =. doi:10.60625/risj-xm9g-ws87 , year =
-
[46]
Nioche, Julien , title =
-
[47]
and Alfano, Mark and Barfuss, Wolfram and Bergstrom, Carl T
Bak-Coleman, Joseph B. and Alfano, Mark and Barfuss, Wolfram and Bergstrom, Carl T. and Centeno, Miguel A. and Couzin, Iain D. and Donges, Jonathan F. and Galesic, Mirta and Gersick, Andrew S. and Jacquet, Jennifer and Kao, Albert B. and Moran, Rachel E. and Romanczuk, Pawel and Rubenstein, Daniel I. and Tombak, Kaia J. and Van Bavel, Jay J. and Weber, El...
-
[48]
2026 , howpublished =
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.