Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

Tyler A · 2024 · cs.CL · arXiv 2408.10441

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.

representative citing papers

Syntactic Belief Update as the Driver of Garden Path Processing Difficulty

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Syntactic belief update via generalized Rényi divergence on syntactic trees predicts garden path reading times better than lexical surprisal.

On the Limits of Model Merging for Multilinguality in Pre-Training

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

cs.CL · 2025-06-02 · unverdicted · novelty 5.0

Inflectional features stay linearly decodable across all layers while lexical identity weakens with depth in modern transformers.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Syntactic Belief Update as the Driver of Garden Path Processing Difficulty cs.CL · 2026-06-25 · unverdicted · none · ref 74 · internal anchor
Syntactic belief update via generalized Rényi divergence on syntactic trees predicts garden path reading times better than lexical surprisal.
On the Limits of Model Merging for Multilinguality in Pre-Training cs.CL · 2026-05-25 · unverdicted · none · ref 9 · internal anchor
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.

Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

fields

years

verdicts

representative citing papers

citing papers explorer