Goldfish: Monolingual Language Models for 350 Languages

Benjamin K. Bergen; Catherine Arnett; Tyler A. Chang; Zhuowen Tu

arxiv: 2408.10441 · v3 · pith:Q654GWMPnew · submitted 2024-08-19 · 💻 cs.CL

Goldfish: Monolingual Language Models for 350 Languages

Tyler A. Chang , Catherine Arnett , Zhuowen Tu , Benjamin K. Bergen This is my paper

classification 💻 cs.CL

keywords modelslanguageslanguagemanymonolingualmultilinguallargesmall

0 comments

read the original abstract

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
cs.CL 2025-06 unverdicted novelty 5.0

Inflectional features stay linearly decodable across all layers while lexical identity weakens with depth in modern transformers.