HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Amanda Myntti; Andrey Kutuzov; Barry Haddow; Bhavitvya Malik; Dayy\'an O'Brien; Du\v{s}an Vari\v{s}; Fedor Vitiugin; Gema Ram\'irez S\'anchez; Jan Haji\v{c}; Janine Siewert

arxiv: 2511.01066 · v3 · submitted 2025-11-02 · 💻 cs.CL

HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen , Nikolay Arefev , Mikko Aulamo , Marta Ba\~n\'on , Maja Buljan , Laurie Burchell , Lucas Charpentier , Pinzhen Chen

show 24 more authors

Mariya Fedorova Ona de Gibert Barry Haddow Jan Haji\v{c} Jind\v{r}ich Helcl Andrey Kutuzov Veronika Laippala Zihao Li Risto Luukkonen Bhavitvya Malik Vladislav Mikhailov Amanda Myntti Dayy\'an O'Brien Lucie Pol\'akov\'a Sampo Pyysalo Gema Ram\'irez S\'anchez Janine Siewert Pavel Stepachev J\"org Tiedemann Teemu Vahtola Du\v{s}an Vari\v{s} Fedor Vitiugin Tea Vojt\v{e}chov\'a Jaume Zaragoza

This is my paper

Pith reviewed 2026-05-18 00:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual datasetsLLM pre-trainingweb crawlslanguage identificationparallel corporamachine translationpre-trained modelsdata filtering

0 comments

The pith

Processed web crawls yield 30 trillion tokens of pre-training data across nearly 200 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the ongoing creation of open, large-scale textual datasets for almost 200 languages by drawing from multiple web crawls. At a total of 30 trillion tokens, these resources come with a complete open-source pipeline that handles document selection, HTML text extraction, language identification in noisy content, deduplication, quality annotations, and final filtering. The work also supplies evaluation benchmarks for nine European languages, a set of trained monolingual models, and automatically mined parallel texts for machine translation. A sympathetic reader would care because access to this volume and breadth of data could support the development of more capable multilingual language technologies without depending on closed collections.

Core claim

The authors present HPLT 3.0 as an open collection of mono- and bilingual data at 30 trillion tokens derived from web archives, accompanied by an open pipeline for selection, extraction, deduplication, annotation with register and quality labels plus personally identifiable information detection, and final filtering. They assess data quality via contrastive statistics, manual sample review across 24 languages, and end-to-end training of encoder-decoder and GPT-like models, while also releasing comprehensive multilingual benchmarks and a synthesized parallel corpus.

What carries the argument

The automated pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with register labels and text quality estimates, and final selection and filtering.

If this is right

Training data at this scale becomes openly available for languages that previously lacked sufficient resources.
The supplied benchmarks enable more consistent evaluation of multilingual models with reduced prompt sensitivity.
Mined parallel texts support direct improvements in machine translation systems for many language pairs.
The open pipeline allows other groups to reproduce or extend the data creation process on new crawls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The collection could reduce reliance on English-dominant data when building models that need to handle diverse languages equally.
The quality annotations might serve as a basis for selecting subsets tailored to specific downstream tasks such as instruction tuning.
Future work could test whether adding this data to existing mixes improves performance on low-resource language benchmarks more than scaling English data alone.

Load-bearing premise

The automated pipeline for processing noisy web data produces high-quality text suitable for LLM pre-training across nearly 200 languages.

What would settle it

If language models trained on subsets of this data perform substantially worse than equivalent models trained on smaller curated corpora when evaluated on the provided multilingual benchmarks, the data quality claim would be undermined.

Figures

Figures reproduced from arXiv: 2511.01066 by Amanda Myntti, Andrey Kutuzov, Barry Haddow, Bhavitvya Malik, Dayy\'an O'Brien, Du\v{s}an Vari\v{s}, Fedor Vitiugin, Gema Ram\'irez S\'anchez, Jan Haji\v{c}, Janine Siewert, Jaume Zaragoza, Jind\v{r}ich Helcl, J\"org Tiedemann, Laurie Burchell, Lucas Charpentier, Lucie Pol\'akov\'a, Maja Buljan, Mariya Fedorova, Marta Ba\~n\'on, Mikko Aulamo, Nikolay Arefev, Ona de Gibert, Pavel Stepachev, Pinzhen Chen, Risto Luukkonen, Sampo Pyysalo, Stephan Oepen, Tea Vojt\v{e}chov\'a, Teemu Vahtola, Veronika Laippala, Vladislav Mikhailov, Zihao Li.

**Figure 1.** Figure 1: Schematic overview of data preparation. also briefly report on ongoing work to derive novel bilingual datasets for 28 language pairs, provide associated machine translation models, and synthesize additional pre-training data for underrepresented languages by machine translation of very high-quality English documents. In our view, it is the totality of generally available and very largescale resources an… view at source ↗

**Figure 2.** Figure 2: Comparison of models pretrained on FineWeb, HPLT 2.0, 3.0, and MADLAD-400. to average-based aggregation, allowing for aggregating heterogeneous metrics by leveraging rankbased differences among models. 7.2. Results In line with Penedo et al. (2025), we find that tasks for lesser-resourced languages, notably Basque and Galician, are unsuitable for pretraining evaluation due to potential difficulty and la… view at source ↗

read the original abstract

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a large-scale open data release at 30T tokens across nearly 200 languages with pipeline and some models, but quality evidence is thin outside the 24 languages they manually checked.

read the letter

The main takeaway is that HPLT 3.0 ships a genuinely big multilingual corpus—30 trillion tokens, nearly 200 languages—plus an open processing pipeline, annotations for register and quality, mined parallel data, benchmarks for nine European languages, and a set of trained monolingual models. That combination of scale, coverage, and openness is the real contribution here. Releasing the full pipeline for web archive handling, HTML extraction, deduplication, and filtering is helpful for anyone who wants to reproduce or extend the work rather than just download the final dump. The end-to-end model evaluations and the contrastive quality probes give at least some signal that the data can be used for training. The parallel corpus synthesized via MT is a practical addition on top of the monolingual stuff. All of that earns credit as a usable resource drop. The soft spot sits in the quality claims for the long tail. They did manual inspection and probes for 24 languages, which is better than many releases, but that leaves roughly 170 languages resting on automated language ID, deduplication, and filtering with no per-language token counts or error rates shown. For low-resource languages the usual problems with noisy web text and LID accuracy are well known, so the assumption that the pipeline produces uniformly suitable pretraining data across the board is the part that needs more visible support. This paper is aimed at groups building multilingual LLMs or MT systems who need raw scale and are willing to do their own filtering or validation on top. A reader who wants an open starting point with some accompanying benchmarks and models will get concrete value from the release itself. It deserves peer review because the resource is large enough that the community will use it; referees can push for clearer breakdowns by resource tier and more explicit failure-mode discussion without changing the core offering.

Referee Report

2 major / 2 minor

Summary. The paper presents HPLT 3.0, an ongoing release of very large-scale multilingual textual datasets covering nearly 200 languages and totaling 30 trillion tokens, derived from web crawls via an open-source pipeline that includes document selection, HTML text extraction, language identification on noisy text, deduplication, annotation (register, quality, PII), and filtering. It reports quality probes via contrastive statistics and manual inspection of samples for 24 languages, end-to-end evaluations of trained models, a set of multilingual benchmarks for nine European languages emphasizing native tasks and prompt mitigation, 57 monolingual encoder-decoder models plus reference GPT-like models, and both mined and MT-synthesized parallel corpora.

Significance. If the quality and coverage claims hold, this constitutes a major resource contribution to multilingual LLM pre-training and MT research by providing one of the largest openly available collections at this scale, together with the full processing pipeline and richly annotated data. The open-source pipeline, provision of reproducible model training and evaluation setups, and focus on natively created evaluation tasks are explicit strengths that support broader adoption and further research on low-resource languages.

major comments (2)

[Abstract / Data Quality Probes] Abstract and the section describing data quality probes and manual inspection: the central claim that the end-to-end pipeline yields data suitable for LLM pre-training across nearly 200 languages rests on automated LID, deduplication, and filtering steps whose performance is validated through manual inspection and quality probes for only 24 languages plus aggregate model evaluations. Without per-language token counts by resource tier, error rates for the remaining ~170 languages, or explicit failure-mode analysis on noisy low-resource web text, the generalization of quality claims remains only moderately supported and is load-bearing for the utility of the full release.
[End-to-end Evaluation] The section on end-to-end model evaluations: while the paper states that various language model architectures were trained and evaluated on the data, the provided description lacks detailed quantitative results, per-language breakdowns, or error analysis that would directly substantiate the quality claims for the full language set; this weakens the evidential link between the pipeline and downstream suitability.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise table summarizing token counts or document volumes broken down by language-resource tier (high/medium/low) to give readers an immediate sense of coverage distribution.
[Multilingual Evaluation] Clarify the exact overlap or distinction between the 24 languages used for manual inspection and the nine languages covered by the new multilingual benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of the HPLT 3.0 manuscript. We have addressed the major comments on data quality validation and end-to-end evaluations by clarifying the scope of our probes, adding requested details where feasible, and explaining the practical limitations of exhaustive per-language analysis at this scale. Revisions have been incorporated to strengthen the evidential basis without overstating claims.

read point-by-point responses

Referee: [Abstract / Data Quality Probes] Abstract and the section describing data quality probes and manual inspection: the central claim that the end-to-end pipeline yields data suitable for LLM pre-training across nearly 200 languages rests on automated LID, deduplication, and filtering steps whose performance is validated through manual inspection and quality probes for only 24 languages plus aggregate model evaluations. Without per-language token counts by resource tier, error rates for the remaining ~170 languages, or explicit failure-mode analysis on noisy low-resource web text, the generalization of quality claims remains only moderately supported and is load-bearing for the utility of the full release.

Authors: We agree that stronger per-language evidence would be ideal. The 24 languages were selected as a stratified sample across high-, medium-, and low-resource tiers to probe pipeline behavior on representative noisy web text. In the revised manuscript we have added a table with per-language token counts broken down by resource tier for all languages with available metadata. We have also expanded the failure-mode analysis subsection to discuss common issues (e.g., boilerplate, code-switching, and LID errors) observed in low-resource crawls and the mitigation steps applied. However, exhaustive manual error-rate annotation for the remaining ~170 languages is not feasible within the scope of this release due to annotation cost and the absence of gold-standard references at this scale; we therefore rely on the combination of contrastive statistics, the sampled manual review, and downstream model performance as supporting evidence. revision: partial
Referee: [End-to-end Evaluation] The section on end-to-end model evaluations: while the paper states that various language model architectures were trained and evaluated on the data, the provided description lacks detailed quantitative results, per-language breakdowns, or error analysis that would directly substantiate the quality claims for the full language set; this weakens the evidential link between the pipeline and downstream suitability.

Authors: We have revised the end-to-end evaluation section to include additional quantitative results (perplexity and downstream task scores), per-language breakdowns for the nine languages covered by our new benchmark suite, and further error analysis of model outputs. These results are now presented with explicit links back to the data-quality probes. For the broader set of languages, detailed per-language breakdowns remain limited by the availability of native evaluation resources; we therefore report aggregated metrics while noting that the primary purpose of the model training was to validate overall pipeline utility rather than to produce state-of-the-art models for every language. revision: yes

Circularity Check

0 steps flagged

No circularity: data release with independent external validation

full rationale

The paper is a resource release describing an automated pipeline for curating 30T tokens of multilingual data across ~200 languages, including web archive selection, HTML extraction, LID, deduplication, annotation, and filtering. Quality is assessed via contrastive statistics, manual inspection of samples for 24 languages, and end-to-end evaluations of trained monolingual models on separate benchmarks. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains are used to justify central claims. The work is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of the described web-to-text pipeline and the assumption that filtered web data yields high-quality training material for LLMs across many languages.

axioms (1)

domain assumption Web crawls contain sufficient high-quality text for low-resource languages after automated filtering and annotation
Invoked in the description of the pipeline and final selection steps.

pith-pipeline@v0.9.0 · 5964 in / 1229 out tokens · 38453 ms · 2026-05-18T00:57:27.788744+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with... register labels, text quality estimates... and final selection and filtering
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

30 trillion tokens... nearly 200 languages

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Introduction Massive text collections for pre-training are the “crude oil” of the LLM era. The process of “refin- ing” high-quality datasets from web data at scale presupposescomputationalinfrastructureandtech- nological muscle that oftentimes is characteristic of corporate involvement, as evidenced for example by some notable generally available pre-trai...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

wide crawls

Raw Web Archive Data There are few available collections of massive web archives. Our work builds on the same set of so- called “wide crawls” from the Internet Archive (IA) as the HPLT 2.0 release, but combines this with a broader and much larger set of snapshots from the Common Crawl (CC). Specifically, we start from some 3.3 petabytes of IA crawls from ...

work page 2012
[3]

| |”) and the token share of each subset of the total (“%

Monolingual Data Preparation Extraction of high-quality and richly annotated text from raw web archives proceeds through a se- quence of refinement and filtering steps. Figure 1 shows the main components of our data prepara- tion, which is an updated and extended version of the open-source HPLT pipeline by Burchell et al. (2025). The following paragraphs ...

work page 2025
[4]

Overall Dataset Statistics To put the HPLT 3.0 monolingual dataset into perspective, Table 1 presents document and to- ken counts5 for the English and multilingual (‘non- English’)partitionsofthedata, aswellascountsfor a small sample of individual languages. For ease of comparison, these statistics are accompanied with average document lengths and per-lan...

work page 2025
[5]

(2025), we calculate descriptive statisticsforHPLT3.0usingtheHPLTAnalyticstool 6 and compare HPLT 3.0 to their dataset

In-Depth Analytics Like Burchell et al. (2025), we calculate descriptive statisticsforHPLT3.0usingtheHPLTAnalyticstool 6 and compare HPLT 3.0 to their dataset. Descriptive StatisticsWe notice a substantial difference in unique segments, 73% in HPLT 3.0 vs. 52% in HPLT 2.0, on average. This likely re- flects global rather than per-crawl deduplication (seea...

work page 2025
[6]

boilerplate

Manual Inspection For 23 languages, native or fluent speakers have manually inspected randomly sampled documents and marked those that contain pornographic con- tent, text with artifacts (e.g. navigational elements, headings or list items without proper delimitation, truncated text, or snippet markers), unnatural text (e.g. word lists for search engine op...

work page
[7]

These languages are chosen to ensure both availability of native speakers in our develop- mentteamandaminimumlevelofdiversityinterms of language resources, families, and scripts

Multilingual LLM Evaluation We develop HPLT-E, a framework for automated large-scale multilingual evaluation designed to sys- tematically compare and refine data preparation choices across nine selected languages shown in Table 1. These languages are chosen to ensure both availability of native speakers in our develop- mentteamandaminimumlevelofdiversityi...

work page 2024
[8]

random” sampling represents the default approach, drawing uniformly on the full dataset, while “top

and (ii) prompt creation by native speakers. Task SelectionWe use standard task-specific metrics and report the maximum score across the prompts as the main aggregation method. We extend the FineTasks evaluation design (Penedo et al., 2025) and select tasks that provide pretrain- ing evaluation signal based on the following key criteria:monotonicity– the ...

work page 2025
[9]

Monolingual Encoder–Decoders In this section, we apply the HPLT 3.0 dataset to train and evaluate 57 language-specific monolin- gual encoder–decoder language models, follow- ing the T5-base architecture (Raffel et al., 2020). This novel family of models, including intermediate checkpoints, is publicly available.8 Motivation & ApproachDespite the popularit...

work page 2020
[10]

to evaluate HPLT 3.0 quality as training data across a large number of languages; and

work page
[11]

to provide a family of comparable monolingual encoder–decoders trained on current data. Asregardsthesecondpurpose, wenotetheonly available encoder–decoder with multilingual capa- bilities is mT5 pretrained on the mC4 corpus (Xue et al., 2021), and its instruction-tuned derivatives, e.g. Muennighoff et al. (2023b); Üstün et al. (2024). 8https://huggingface...

work page 2021
[12]

Mining for Bilingual Texts After constructing and evaluating the monolingual corpus, a natural next step is to further leverage theseresourcestomineparalleldata. Althoughthe field is increasingly favoring LLMs over traditional encoder–decoder architectures, the use of parallel corpora in LLM pretraining has been shown to en- hance multilingual capabilitie...

work page 2024
[13]

MT for Synthetic Data Generation Many studies have shown the value of synthetic data, for example, (Doshi et al., 2024; Wang et al.,

work page 2024
[14]

smallish

and in this work we explore the use of ma- chinetranslationasaneffectiveshort-cuttotransfer knowledge from a resource-rich source language to a under-resourced target languages (de Gibert et al., 2025). Many open-source models are available and can easily be integrated in translation workflows (Tiede- mann et al., 2023; Team et al., 2022). For scalabil- i...

work page 2025
[15]

Alldata,models,andsoftware involved in this initiative are publicly available un- der permissive terms of use

Conclusions & Outlook Our work substantially refines an existing pipeline for very large-scale preparation of mono- and bi- lingual datasets and applies it to a massive collec- tionofwebarchives. Alldata,models,andsoftware involved in this initiative are publicly available un- der permissive terms of use. The HPLT 3.0 dataset constitutes the largest multi...

work page
[16]

Datasets for lower-resourced languages can be biased in very different ways (e.g

Ethics Statement DatasetscrawledfromtheWebcancontainallsorts ofoffensiveor harmfulcontent, eventhough weare making our best to clean the data. Datasets for lower-resourced languages can be biased in very different ways (e.g. religious content is often over- represented in African languages datasets)

work page
[17]

We do not claim that the datasets de- scribed above are completely free from defects

Limitations of this Work Although we conducted a (limited) human evalua- tion of a selection of languages from the HPLT 3.0 datasets, the problem of robustly estimating the quality of web-crawled training data is far from be- ing solved. We do not claim that the datasets de- scribed above are completely free from defects

work page
[18]

This project has received funding from the Hori- zon Europe research and innovation programme of the European Union under Grant agreement No

Acknowledgments We thank Étienne Simon (UiO) and Daryna De- mentieva (TUM) for their contribution to our prompt collection for French and Ukrainian and Erik Hen- riksson, Erofili Psaltaki, and Otto Tarkka (all UTU) for their contributions to the manual inspection. This project has received funding from the Hori- zon Europe research and innovation programm...

work page
[19]

Bibliographical References Stephen Bach, Victor Sanh, Zheng Xin Yong, Al- bert Webson, Colin Raffel, Nihal V. Nayak, Ab- heesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, An- drea Santilli, Zhiqing Sun, Srulik Ben-david, Can- wen Xu, Gunjan Chhablani, Han Wang, Ja- son Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thak...

work page arXiv 2022
[20]

Open llm leaderboard v2. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, ...

work page arXiv 2024
[21]

[Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

Intersecting register and genre: Under- standing the contents of web-crawled corpora. InProceedings of the 4th International Confer- ence on Natural Language Processing for Digital Humanities, pages 386–397, Miami, USA. Asso- ciation for Computational Linguistics. Dayyán O’Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, and Jörg Tiede- ...

work page arXiv 2025
[22]

Pouya Pezeshkpour and Estevam Hruschka

FineWeb2: One pipeline to scale them all – adapting pre-training data processing to every language. Pouya Pezeshkpour and Estevam Hruschka. 2024. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico. Association ...

work page arXiv 2024
[23]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

INCLUDE: Evaluating multilingual lan- guage understanding with regional knowledge. InThe Thirteenth International Conference on Learning Representations. Mariana Romanyshyn, Oleksiy Syvokon, and Ro- man Kyslyi. 2024. The UNLP 2024 shared task on fine-tuning large language models for Ukrainian. InProceedings of the Third Ukrainian Natural Language Processi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, and Yang Liu

Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation. Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, and Yang Liu. 2024. Enhancing multilingual capa- bilities of large language models through self- distillation from resource-rich languages. InPro- ceedings of the 62nd Annual Meeting of the ...

work page 2024

[1] [1]

HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Introduction Massive text collections for pre-training are the “crude oil” of the LLM era. The process of “refin- ing” high-quality datasets from web data at scale presupposescomputationalinfrastructureandtech- nological muscle that oftentimes is characteristic of corporate involvement, as evidenced for example by some notable generally available pre-trai...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

wide crawls

Raw Web Archive Data There are few available collections of massive web archives. Our work builds on the same set of so- called “wide crawls” from the Internet Archive (IA) as the HPLT 2.0 release, but combines this with a broader and much larger set of snapshots from the Common Crawl (CC). Specifically, we start from some 3.3 petabytes of IA crawls from ...

work page 2012

[3] [3]

| |”) and the token share of each subset of the total (“%

Monolingual Data Preparation Extraction of high-quality and richly annotated text from raw web archives proceeds through a se- quence of refinement and filtering steps. Figure 1 shows the main components of our data prepara- tion, which is an updated and extended version of the open-source HPLT pipeline by Burchell et al. (2025). The following paragraphs ...

work page 2025

[4] [4]

Overall Dataset Statistics To put the HPLT 3.0 monolingual dataset into perspective, Table 1 presents document and to- ken counts5 for the English and multilingual (‘non- English’)partitionsofthedata, aswellascountsfor a small sample of individual languages. For ease of comparison, these statistics are accompanied with average document lengths and per-lan...

work page 2025

[5] [5]

(2025), we calculate descriptive statisticsforHPLT3.0usingtheHPLTAnalyticstool 6 and compare HPLT 3.0 to their dataset

In-Depth Analytics Like Burchell et al. (2025), we calculate descriptive statisticsforHPLT3.0usingtheHPLTAnalyticstool 6 and compare HPLT 3.0 to their dataset. Descriptive StatisticsWe notice a substantial difference in unique segments, 73% in HPLT 3.0 vs. 52% in HPLT 2.0, on average. This likely re- flects global rather than per-crawl deduplication (seea...

work page 2025

[6] [6]

boilerplate

Manual Inspection For 23 languages, native or fluent speakers have manually inspected randomly sampled documents and marked those that contain pornographic con- tent, text with artifacts (e.g. navigational elements, headings or list items without proper delimitation, truncated text, or snippet markers), unnatural text (e.g. word lists for search engine op...

work page

[7] [7]

These languages are chosen to ensure both availability of native speakers in our develop- mentteamandaminimumlevelofdiversityinterms of language resources, families, and scripts

Multilingual LLM Evaluation We develop HPLT-E, a framework for automated large-scale multilingual evaluation designed to sys- tematically compare and refine data preparation choices across nine selected languages shown in Table 1. These languages are chosen to ensure both availability of native speakers in our develop- mentteamandaminimumlevelofdiversityi...

work page 2024

[8] [8]

random” sampling represents the default approach, drawing uniformly on the full dataset, while “top

and (ii) prompt creation by native speakers. Task SelectionWe use standard task-specific metrics and report the maximum score across the prompts as the main aggregation method. We extend the FineTasks evaluation design (Penedo et al., 2025) and select tasks that provide pretrain- ing evaluation signal based on the following key criteria:monotonicity– the ...

work page 2025

[9] [9]

Monolingual Encoder–Decoders In this section, we apply the HPLT 3.0 dataset to train and evaluate 57 language-specific monolin- gual encoder–decoder language models, follow- ing the T5-base architecture (Raffel et al., 2020). This novel family of models, including intermediate checkpoints, is publicly available.8 Motivation & ApproachDespite the popularit...

work page 2020

[10] [10]

to evaluate HPLT 3.0 quality as training data across a large number of languages; and

work page

[11] [11]

to provide a family of comparable monolingual encoder–decoders trained on current data. Asregardsthesecondpurpose, wenotetheonly available encoder–decoder with multilingual capa- bilities is mT5 pretrained on the mC4 corpus (Xue et al., 2021), and its instruction-tuned derivatives, e.g. Muennighoff et al. (2023b); Üstün et al. (2024). 8https://huggingface...

work page 2021

[12] [12]

Mining for Bilingual Texts After constructing and evaluating the monolingual corpus, a natural next step is to further leverage theseresourcestomineparalleldata. Althoughthe field is increasingly favoring LLMs over traditional encoder–decoder architectures, the use of parallel corpora in LLM pretraining has been shown to en- hance multilingual capabilitie...

work page 2024

[13] [13]

MT for Synthetic Data Generation Many studies have shown the value of synthetic data, for example, (Doshi et al., 2024; Wang et al.,

work page 2024

[14] [14]

smallish

and in this work we explore the use of ma- chinetranslationasaneffectiveshort-cuttotransfer knowledge from a resource-rich source language to a under-resourced target languages (de Gibert et al., 2025). Many open-source models are available and can easily be integrated in translation workflows (Tiede- mann et al., 2023; Team et al., 2022). For scalabil- i...

work page 2025

[15] [15]

Alldata,models,andsoftware involved in this initiative are publicly available un- der permissive terms of use

Conclusions & Outlook Our work substantially refines an existing pipeline for very large-scale preparation of mono- and bi- lingual datasets and applies it to a massive collec- tionofwebarchives. Alldata,models,andsoftware involved in this initiative are publicly available un- der permissive terms of use. The HPLT 3.0 dataset constitutes the largest multi...

work page

[16] [16]

Datasets for lower-resourced languages can be biased in very different ways (e.g

Ethics Statement DatasetscrawledfromtheWebcancontainallsorts ofoffensiveor harmfulcontent, eventhough weare making our best to clean the data. Datasets for lower-resourced languages can be biased in very different ways (e.g. religious content is often over- represented in African languages datasets)

work page

[17] [17]

We do not claim that the datasets de- scribed above are completely free from defects

Limitations of this Work Although we conducted a (limited) human evalua- tion of a selection of languages from the HPLT 3.0 datasets, the problem of robustly estimating the quality of web-crawled training data is far from be- ing solved. We do not claim that the datasets de- scribed above are completely free from defects

work page

[18] [18]

This project has received funding from the Hori- zon Europe research and innovation programme of the European Union under Grant agreement No

Acknowledgments We thank Étienne Simon (UiO) and Daryna De- mentieva (TUM) for their contribution to our prompt collection for French and Ukrainian and Erik Hen- riksson, Erofili Psaltaki, and Otto Tarkka (all UTU) for their contributions to the manual inspection. This project has received funding from the Hori- zon Europe research and innovation programm...

work page

[19] [19]

Bibliographical References Stephen Bach, Victor Sanh, Zheng Xin Yong, Al- bert Webson, Colin Raffel, Nihal V. Nayak, Ab- heesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, An- drea Santilli, Zhiqing Sun, Srulik Ben-david, Can- wen Xu, Gunjan Chhablani, Han Wang, Ja- son Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thak...

work page arXiv 2022

[20] [20]

Open llm leaderboard v2. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, ...

work page arXiv 2024

[21] [21]

[Shen et al.2024] Shen, Huangjun, Liangying Shao, Wenbo Li, Zhibin Lan, Zhanyu Liu, and Jinsong Su

Intersecting register and genre: Under- standing the contents of web-crawled corpora. InProceedings of the 4th International Confer- ence on Natural Language Processing for Digital Humanities, pages 386–397, Miami, USA. Asso- ciation for Computational Linguistics. Dayyán O’Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, and Jörg Tiede- ...

work page arXiv 2025

[22] [22]

Pouya Pezeshkpour and Estevam Hruschka

FineWeb2: One pipeline to scale them all – adapting pre-training data processing to every language. Pouya Pezeshkpour and Estevam Hruschka. 2024. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico. Association ...

work page arXiv 2024

[23] [23]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

INCLUDE: Evaluating multilingual lan- guage understanding with regional knowledge. InThe Thirteenth International Conference on Learning Representations. Mariana Romanyshyn, Oleksiy Syvokon, and Ro- man Kyslyi. 2024. The UNLP 2024 shared task on fine-tuning large language models for Ukrainian. InProceedings of the Third Ukrainian Natural Language Processi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, and Yang Liu

Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation. Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, and Yang Liu. 2024. Enhancing multilingual capa- bilities of large language models through self- distillation from resource-rich languages. InPro- ceedings of the 62nd Annual Meeting of the ...

work page 2024