TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
19 Pith papers cite this work, alongside 289 external citations. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
Catalogue records show 141 languages as data-poor, but citation mining reveals 609 datasets across 53 languages, exposing a visibility gap in multilingual NLP resources.
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
High-volume plaintiff-side counsel in Philadelphia eviction cases scales up filing volume and procedural steps but does not produce a broad premium on adverse tenant outcomes such as default or judgment.
A multi-level categorization from language distributions in DBpedia, BabelNet, and Wikidata defines low-resource languages for Semantic Web knowledge graphs.
Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.
Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.
Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.
A research plan to analyze language distribution in LOD knowledge graphs and explore cross-lingual transfer plus analogical reasoning to improve coverage for low-resource languages.
Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
citing papers explorer
-
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
-
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
-
Towards Measuring the Representation of Subjective Global Opinions in Language Models
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
-
Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP
Catalogue records show 141 languages as data-poor, but citation mining reveals 609 datasets across 53 languages, exposing a visibility gap in multilingual NLP resources.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.
-
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
-
High-Volume Plaintiff-Side Counsel and Single-Appearance Eviction Cases in Philadelphia
High-volume plaintiff-side counsel in Philadelphia eviction cases scales up filing volume and procedural steps but does not produce a broad premium on adverse tenant outcomes such as default or judgment.
-
Which Are the Low-Resource Languages of the Semantic Web?
A multi-level categorization from language distributions in DBpedia, BabelNet, and Wikidata defines low-resource languages for Semantic Web knowledge graphs.
-
Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs
Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.
-
Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling
Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.
-
How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language
Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.
-
In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs
A research plan to analyze language distribution in LOD knowledge graphs and explore cross-lingual transfer plus analogical reasoning to improve coverage for low-resource languages.
-
Adam's Law: Textual Frequency Law on Large Language Models
Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.
-
Multilingual Vision-Language Models, A Survey
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.