arXiv preprint arXiv:2506.20920 , year=

FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author= · 2025 · arXiv 2506.20920

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

cs.CL · 2026-05-18 · accept · novelty 6.0

SomaliWeb v1 delivers a cleaned Somali corpus, efficient BPE tokenizer, and side-by-side language identification benchmark while documenting defects in prior multilingual datasets.

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

cs.LG · 2026-05-13 · conditional · novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.

Scaling Laws for Mixture Pretraining Under Data Constraints

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.

Granite Embedding Multilingual R2 Models

cs.IR · 2026-05-13 · unverdicted · novelty 4.0

Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

cs.CL · 2025-08-08 · unverdicted · novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

citing papers explorer

Showing 6 of 6 citing papers.

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE cs.CL · 2026-05-18 · unverdicted · none · ref 24
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark cs.CL · 2026-05-18 · accept · none · ref 22
SomaliWeb v1 delivers a cleaned Somali corpus, efficient BPE tokenizer, and side-by-side language identification benchmark while documenting defects in prior multilingual datasets.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 18
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 34 · 2 links
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
Granite Embedding Multilingual R2 Models cs.IR · 2026-05-13 · unverdicted · none · ref 14
Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models cs.CL · 2025-08-08 · unverdicted · none · ref 27
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

arXiv preprint arXiv:2506.20920 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer