HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Pith reviewed 2026-05-18 00:57 UTC · model grok-4.3
The pith
Processed web crawls yield 30 trillion tokens of pre-training data across nearly 200 languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present HPLT 3.0 as an open collection of mono- and bilingual data at 30 trillion tokens derived from web archives, accompanied by an open pipeline for selection, extraction, deduplication, annotation with register and quality labels plus personally identifiable information detection, and final filtering. They assess data quality via contrastive statistics, manual sample review across 24 languages, and end-to-end training of encoder-decoder and GPT-like models, while also releasing comprehensive multilingual benchmarks and a synthesized parallel corpus.
What carries the argument
The automated pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with register labels and text quality estimates, and final selection and filtering.
If this is right
- Training data at this scale becomes openly available for languages that previously lacked sufficient resources.
- The supplied benchmarks enable more consistent evaluation of multilingual models with reduced prompt sensitivity.
- Mined parallel texts support direct improvements in machine translation systems for many language pairs.
- The open pipeline allows other groups to reproduce or extend the data creation process on new crawls.
Where Pith is reading between the lines
- The collection could reduce reliance on English-dominant data when building models that need to handle diverse languages equally.
- The quality annotations might serve as a basis for selecting subsets tailored to specific downstream tasks such as instruction tuning.
- Future work could test whether adding this data to existing mixes improves performance on low-resource language benchmarks more than scaling English data alone.
Load-bearing premise
The automated pipeline for processing noisy web data produces high-quality text suitable for LLM pre-training across nearly 200 languages.
What would settle it
If language models trained on subsets of this data perform substantially worse than equivalent models trained on smaller curated corpora when evaluated on the provided multilingual benchmarks, the data quality claim would be undermined.
Figures
read the original abstract
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HPLT 3.0, an ongoing release of very large-scale multilingual textual datasets covering nearly 200 languages and totaling 30 trillion tokens, derived from web crawls via an open-source pipeline that includes document selection, HTML text extraction, language identification on noisy text, deduplication, annotation (register, quality, PII), and filtering. It reports quality probes via contrastive statistics and manual inspection of samples for 24 languages, end-to-end evaluations of trained models, a set of multilingual benchmarks for nine European languages emphasizing native tasks and prompt mitigation, 57 monolingual encoder-decoder models plus reference GPT-like models, and both mined and MT-synthesized parallel corpora.
Significance. If the quality and coverage claims hold, this constitutes a major resource contribution to multilingual LLM pre-training and MT research by providing one of the largest openly available collections at this scale, together with the full processing pipeline and richly annotated data. The open-source pipeline, provision of reproducible model training and evaluation setups, and focus on natively created evaluation tasks are explicit strengths that support broader adoption and further research on low-resource languages.
major comments (2)
- [Abstract / Data Quality Probes] Abstract and the section describing data quality probes and manual inspection: the central claim that the end-to-end pipeline yields data suitable for LLM pre-training across nearly 200 languages rests on automated LID, deduplication, and filtering steps whose performance is validated through manual inspection and quality probes for only 24 languages plus aggregate model evaluations. Without per-language token counts by resource tier, error rates for the remaining ~170 languages, or explicit failure-mode analysis on noisy low-resource web text, the generalization of quality claims remains only moderately supported and is load-bearing for the utility of the full release.
- [End-to-end Evaluation] The section on end-to-end model evaluations: while the paper states that various language model architectures were trained and evaluated on the data, the provided description lacks detailed quantitative results, per-language breakdowns, or error analysis that would directly substantiate the quality claims for the full language set; this weakens the evidential link between the pipeline and downstream suitability.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise table summarizing token counts or document volumes broken down by language-resource tier (high/medium/low) to give readers an immediate sense of coverage distribution.
- [Multilingual Evaluation] Clarify the exact overlap or distinction between the 24 languages used for manual inspection and the nine languages covered by the new multilingual benchmarks.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of the HPLT 3.0 manuscript. We have addressed the major comments on data quality validation and end-to-end evaluations by clarifying the scope of our probes, adding requested details where feasible, and explaining the practical limitations of exhaustive per-language analysis at this scale. Revisions have been incorporated to strengthen the evidential basis without overstating claims.
read point-by-point responses
-
Referee: [Abstract / Data Quality Probes] Abstract and the section describing data quality probes and manual inspection: the central claim that the end-to-end pipeline yields data suitable for LLM pre-training across nearly 200 languages rests on automated LID, deduplication, and filtering steps whose performance is validated through manual inspection and quality probes for only 24 languages plus aggregate model evaluations. Without per-language token counts by resource tier, error rates for the remaining ~170 languages, or explicit failure-mode analysis on noisy low-resource web text, the generalization of quality claims remains only moderately supported and is load-bearing for the utility of the full release.
Authors: We agree that stronger per-language evidence would be ideal. The 24 languages were selected as a stratified sample across high-, medium-, and low-resource tiers to probe pipeline behavior on representative noisy web text. In the revised manuscript we have added a table with per-language token counts broken down by resource tier for all languages with available metadata. We have also expanded the failure-mode analysis subsection to discuss common issues (e.g., boilerplate, code-switching, and LID errors) observed in low-resource crawls and the mitigation steps applied. However, exhaustive manual error-rate annotation for the remaining ~170 languages is not feasible within the scope of this release due to annotation cost and the absence of gold-standard references at this scale; we therefore rely on the combination of contrastive statistics, the sampled manual review, and downstream model performance as supporting evidence. revision: partial
-
Referee: [End-to-end Evaluation] The section on end-to-end model evaluations: while the paper states that various language model architectures were trained and evaluated on the data, the provided description lacks detailed quantitative results, per-language breakdowns, or error analysis that would directly substantiate the quality claims for the full language set; this weakens the evidential link between the pipeline and downstream suitability.
Authors: We have revised the end-to-end evaluation section to include additional quantitative results (perplexity and downstream task scores), per-language breakdowns for the nine languages covered by our new benchmark suite, and further error analysis of model outputs. These results are now presented with explicit links back to the data-quality probes. For the broader set of languages, detailed per-language breakdowns remain limited by the availability of native evaluation resources; we therefore report aggregated metrics while noting that the primary purpose of the model training was to validate overall pipeline utility rather than to produce state-of-the-art models for every language. revision: yes
Circularity Check
No circularity: data release with independent external validation
full rationale
The paper is a resource release describing an automated pipeline for curating 30T tokens of multilingual data across ~200 languages, including web archive selection, HTML extraction, LID, deduplication, annotation, and filtering. Quality is assessed via contrastive statistics, manual inspection of samples for 24 languages, and end-to-end evaluations of trained monolingual models on separate benchmarks. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains are used to justify central claims. The work is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Web crawls contain sufficient high-quality text for low-resource languages after automated filtering and annotation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with... register labels, text quality estimates... and final selection and filtering
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
30 trillion tokens... nearly 200 languages
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Massive text collections for pre-training are the “crude oil” of the LLM era. The process of “refin- ing” high-quality datasets from web data at scale presupposescomputationalinfrastructureandtech- nological muscle that oftentimes is characteristic of corporate involvement, as evidenced for example by some notable generally available pre-trai...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[2]
Raw Web Archive Data There are few available collections of massive web archives. Our work builds on the same set of so- called “wide crawls” from the Internet Archive (IA) as the HPLT 2.0 release, but combines this with a broader and much larger set of snapshots from the Common Crawl (CC). Specifically, we start from some 3.3 petabytes of IA crawls from ...
work page 2012
-
[3]
| |”) and the token share of each subset of the total (“%
Monolingual Data Preparation Extraction of high-quality and richly annotated text from raw web archives proceeds through a se- quence of refinement and filtering steps. Figure 1 shows the main components of our data prepara- tion, which is an updated and extended version of the open-source HPLT pipeline by Burchell et al. (2025). The following paragraphs ...
work page 2025
-
[4]
Overall Dataset Statistics To put the HPLT 3.0 monolingual dataset into perspective, Table 1 presents document and to- ken counts5 for the English and multilingual (‘non- English’)partitionsofthedata, aswellascountsfor a small sample of individual languages. For ease of comparison, these statistics are accompanied with average document lengths and per-lan...
work page 2025
-
[5]
In-Depth Analytics Like Burchell et al. (2025), we calculate descriptive statisticsforHPLT3.0usingtheHPLTAnalyticstool 6 and compare HPLT 3.0 to their dataset. Descriptive StatisticsWe notice a substantial difference in unique segments, 73% in HPLT 3.0 vs. 52% in HPLT 2.0, on average. This likely re- flects global rather than per-crawl deduplication (seea...
work page 2025
-
[6]
Manual Inspection For 23 languages, native or fluent speakers have manually inspected randomly sampled documents and marked those that contain pornographic con- tent, text with artifacts (e.g. navigational elements, headings or list items without proper delimitation, truncated text, or snippet markers), unnatural text (e.g. word lists for search engine op...
-
[7]
Multilingual LLM Evaluation We develop HPLT-E, a framework for automated large-scale multilingual evaluation designed to sys- tematically compare and refine data preparation choices across nine selected languages shown in Table 1. These languages are chosen to ensure both availability of native speakers in our develop- mentteamandaminimumlevelofdiversityi...
work page 2024
-
[8]
random” sampling represents the default approach, drawing uniformly on the full dataset, while “top
and (ii) prompt creation by native speakers. Task SelectionWe use standard task-specific metrics and report the maximum score across the prompts as the main aggregation method. We extend the FineTasks evaluation design (Penedo et al., 2025) and select tasks that provide pretrain- ing evaluation signal based on the following key criteria:monotonicity– the ...
work page 2025
-
[9]
Monolingual Encoder–Decoders In this section, we apply the HPLT 3.0 dataset to train and evaluate 57 language-specific monolin- gual encoder–decoder language models, follow- ing the T5-base architecture (Raffel et al., 2020). This novel family of models, including intermediate checkpoints, is publicly available.8 Motivation & ApproachDespite the popularit...
work page 2020
-
[10]
to evaluate HPLT 3.0 quality as training data across a large number of languages; and
-
[11]
to provide a family of comparable monolingual encoder–decoders trained on current data. Asregardsthesecondpurpose, wenotetheonly available encoder–decoder with multilingual capa- bilities is mT5 pretrained on the mC4 corpus (Xue et al., 2021), and its instruction-tuned derivatives, e.g. Muennighoff et al. (2023b); Üstün et al. (2024). 8https://huggingface...
work page 2021
-
[12]
Mining for Bilingual Texts After constructing and evaluating the monolingual corpus, a natural next step is to further leverage theseresourcestomineparalleldata. Althoughthe field is increasingly favoring LLMs over traditional encoder–decoder architectures, the use of parallel corpora in LLM pretraining has been shown to en- hance multilingual capabilitie...
work page 2024
-
[13]
MT for Synthetic Data Generation Many studies have shown the value of synthetic data, for example, (Doshi et al., 2024; Wang et al.,
work page 2024
-
[14]
and in this work we explore the use of ma- chinetranslationasaneffectiveshort-cuttotransfer knowledge from a resource-rich source language to a under-resourced target languages (de Gibert et al., 2025). Many open-source models are available and can easily be integrated in translation workflows (Tiede- mann et al., 2023; Team et al., 2022). For scalabil- i...
work page 2025
-
[15]
Conclusions & Outlook Our work substantially refines an existing pipeline for very large-scale preparation of mono- and bi- lingual datasets and applies it to a massive collec- tionofwebarchives. Alldata,models,andsoftware involved in this initiative are publicly available un- der permissive terms of use. The HPLT 3.0 dataset constitutes the largest multi...
-
[16]
Datasets for lower-resourced languages can be biased in very different ways (e.g
Ethics Statement DatasetscrawledfromtheWebcancontainallsorts ofoffensiveor harmfulcontent, eventhough weare making our best to clean the data. Datasets for lower-resourced languages can be biased in very different ways (e.g. religious content is often over- represented in African languages datasets)
-
[17]
We do not claim that the datasets de- scribed above are completely free from defects
Limitations of this Work Although we conducted a (limited) human evalua- tion of a selection of languages from the HPLT 3.0 datasets, the problem of robustly estimating the quality of web-crawled training data is far from be- ing solved. We do not claim that the datasets de- scribed above are completely free from defects
-
[18]
Acknowledgments We thank Étienne Simon (UiO) and Daryna De- mentieva (TUM) for their contribution to our prompt collection for French and Ukrainian and Erik Hen- riksson, Erofili Psaltaki, and Otto Tarkka (all UTU) for their contributions to the manual inspection. This project has received funding from the Hori- zon Europe research and innovation programm...
-
[19]
Bibliographical References Stephen Bach, Victor Sanh, Zheng Xin Yong, Al- bert Webson, Colin Raffel, Nihal V. Nayak, Ab- heesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, An- drea Santilli, Zhiqing Sun, Srulik Ben-david, Can- wen Xu, Gunjan Chhablani, Han Wang, Ja- son Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thak...
-
[20]
Open llm leaderboard v2. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, ...
-
[21]
[Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su
Intersecting register and genre: Under- standing the contents of web-crawled corpora. InProceedings of the 4th International Confer- ence on Natural Language Processing for Digital Humanities, pages 386–397, Miami, USA. Asso- ciation for Computational Linguistics. Dayyán O’Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, and Jörg Tiede- ...
-
[22]
Pouya Pezeshkpour and Estevam Hruschka
FineWeb2: One pipeline to scale them all – adapting pre-training data processing to every language. Pouya Pezeshkpour and Estevam Hruschka. 2024. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico. Association ...
-
[23]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
INCLUDE: Evaluating multilingual lan- guage understanding with regional knowledge. InThe Thirteenth International Conference on Learning Representations. Mariana Romanyshyn, Oleksiy Syvokon, and Ro- man Kyslyi. 2024. The UNLP 2024 shared task on fine-tuning large language models for Ukrainian. InProceedings of the Third Ukrainian Natural Language Processi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, and Yang Liu
Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation. Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, and Yang Liu. 2024. Enhancing multilingual capa- bilities of large language models through self- distillation from resource-rich languages. InPro- ceedings of the 62nd Annual Meeting of the ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.