Repeating high-quality filtered German web data over multiple epochs produces better language models than single-pass training on larger, more diverse but lower-quality sets, even after seven epochs.
We then fine-tuned three respective snowflake-arctic-embed-m-v2.0 (Yu et al.,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
Repeating high-quality filtered German web data over multiple epochs produces better language models than single-pass training on larger, more diverse but lower-quality sets, even after seven epochs.