The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury · 2020 · Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics · DOI 10.18653/v1/2020.acl-main.560 · arXiv 2004.09095

19 Pith papers cite this work, alongside 289 external citations. Polarity classification is still indexing.

19 Pith papers citing it

289 external citations · Crossref

open at publisher browse 19 citing papers arXiv PDF

citation-role summary

background 4

citation-polarity summary

background 3 unclear 1

representative citing papers

Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.

Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.

Towards Measuring the Representation of Subjective Global Opinions in Language Models

cs.CL · 2023-06-28 · conditional · novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Catalogue records show 141 languages as data-poor, but citation mining reveals 609 datasets across 53 languages, exposing a visibility gap in multilingual NLP resources.

Scaling Laws for Mixture Pretraining Under Data Constraints

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

cs.CL · 2025-07-25 · conditional · novelty 6.0

SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

cs.CL · 2024-11-08 · unverdicted · novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.

Lessons from the Trenches on Reproducible Evaluation of Language Models

cs.CL · 2024-05-23 · accept · novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.

High-Volume Plaintiff-Side Counsel and Single-Appearance Eviction Cases in Philadelphia

stat.AP · 2026-05-20 · unverdicted · novelty 5.0

High-volume plaintiff-side counsel in Philadelphia eviction cases scales up filing volume and procedural steps but does not produce a broad premium on adverse tenant outcomes such as default or judgment.

Which Are the Low-Resource Languages of the Semantic Web?

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

A multi-level categorization from language distributions in DBpedia, BabelNet, and Wikidata defines low-resource languages for Semantic Web knowledge graphs.

Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs

cs.CL · 2026-05-02 · unverdicted · novelty 5.0

Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.

How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language

cs.CL · 2025-06-07 · conditional · novelty 5.0

Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.

In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs

cs.AI · 2026-05-07 · unverdicted · novelty 4.0

A research plan to analyze language distribution in LOD knowledge graphs and explore cross-lingual transfer plus analogical reasoning to improve coverage for low-resource languages.

Adam's Law: Textual Frequency Law on Large Language Models

cs.CL · 2026-04-02 · unverdicted · novelty 3.0

Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.

Multilingual Vision-Language Models, A Survey

cs.CL · 2025-09-26 · accept · novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

citing papers explorer

Showing 19 of 19 citing papers.

Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion cs.CL · 2026-04-20 · unverdicted · none · ref 72
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment cs.CL · 2026-04-12 · unverdicted · none · ref 15
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
Towards Measuring the Representation of Subjective Global Opinions in Language Models cs.CL · 2023-06-28 · conditional · none · ref 48
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP cs.CL · 2026-05-17 · unverdicted · none · ref 13
Catalogue records show 141 languages as data-poor, but citation mining reveals 609 datasets across 53 languages, exposing a visibility gap in multilingual NLP resources.
Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 19
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling cs.LG · 2026-04-22 · unverdicted · none · ref 148
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models cs.CL · 2025-07-25 · conditional · none · ref 12
SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 26
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · accept · none · ref 19
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 136
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance cs.CL · 2026-05-21 · unverdicted · none · ref 9
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
High-Volume Plaintiff-Side Counsel and Single-Appearance Eviction Cases in Philadelphia stat.AP · 2026-05-20 · unverdicted · none · ref 26
High-volume plaintiff-side counsel in Philadelphia eviction cases scales up filing volume and procedural steps but does not produce a broad premium on adverse tenant outcomes such as default or judgment.
Which Are the Low-Resource Languages of the Semantic Web? cs.AI · 2026-05-07 · unverdicted · none · ref 2
A multi-level categorization from language distributions in DBpedia, BabelNet, and Wikidata defines low-resource languages for Semantic Web knowledge graphs.
Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs cs.CL · 2026-05-02 · unverdicted · none · ref 64
Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.
Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling cs.CL · 2026-04-28 · unverdicted · none · ref 4
Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.
How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language cs.CL · 2025-06-07 · conditional · none · ref 79
Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.
In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs cs.AI · 2026-05-07 · unverdicted · none · ref 19
A research plan to analyze language distribution in LOD knowledge graphs and explore cross-lingual transfer plus analogical reasoning to improve coverage for low-resource languages.
Adam's Law: Textual Frequency Law on Large Language Models cs.CL · 2026-04-02 · unverdicted · none · ref 21
Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.
Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 72
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer