arxiv: 2207.04672 · v3 · submitted 2022-07-11 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

No Language Left Behind: Scaling Human-Centered Machine Translation

NLLB Team , Marta R. Costa-juss\`a , James Cross , Onur \c{C}elebi , Maha Elbayad , Kenneth Heafield , Kevin Heffernan , Elahe Kalbassi

show 31 more authors

Janice Lam Daniel Licht Jean Maillard Anna Sun Skyler Wang Guillaume Wenzek Al Youngblood Bapi Akula Loic Barrault Gabriel Mejia Gonzalez Prangthip Hansanti John Hoffman Semarley Jarrett Kaushik Ram Sadagopan Dirk Rowe Shannon Spruit Chau Tran Pierre Andrews Necip Fazil Ayan Shruti Bhosale Sergey Edunov Angela Fan Cynthia Gao Vedanuj Goswami Francisco Guzm\'an Philipp Koehn Alexandre Mourachko Christophe Ropers Safiyyah Saleem Holger Schwenk Jeff Wang (NLLB Team)

Authors on Pith no claims yet

Pith reviewed 2026-05-12 17:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords machine translationlow-resource languagesmultilingual modelsmixture of expertsdata miningtranslation safetyuniversal translation

0 comments

The pith

A sparsely gated mixture of experts model trained on mined low-resource data achieves 44% relative BLEU improvement in translating 200 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to extend high-quality machine translation to the great majority of languages that current systems handle poorly. Authors first spoke with native speakers to identify real needs, then built new data collection methods and trained a large conditional-compute model on thousands of language pairs. They added training changes to limit overfitting and measured results with human raters on a new benchmark covering all 200 languages plus a separate toxicity check. If the gains hold, far more language pairs become usable in practice and a path opens toward translation support that does not exclude most of the world's languages.

Core claim

A conditional compute model based on sparsely gated mixture of experts, trained with new data mining techniques for low-resource languages and with added safeguards against overfitting, raises BLEU scores by 44 percent relative to prior state-of-the-art while passing human quality and toxicity evaluations across more than 40,000 translation directions.

What carries the argument

The sparsely gated mixture of experts architecture, which routes each input to a small subset of experts, combined with tailored data mining that targets low-resource languages.

If this is right

Thousands of previously unsupported translation directions become accurate enough for everyday use.
A combined human-quality and toxicity benchmark becomes a standard way to judge multilingual systems.
Releasing the model and mined data lets others add still more languages without starting from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Conditional routing of computation could be applied to scale speech recognition or summarization to the same wide language set.
Gathering direct input from native speakers before model design offers a repeatable way to keep other multilingual tools grounded in actual user needs.
Further increases in model size and data coverage could test whether the same techniques eventually support reliable translation even for languages with almost no written data.

Load-bearing premise

The new data mining and model changes produce genuinely better and safer translations instead of merely matching the new benchmark or the particular human raters used.

What would settle it

An independent human evaluation on a fresh set of low-resource sentence pairs that finds no relative BLEU gain or that finds higher rates of toxic outputs.

read the original abstract

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NLLB scales MT to 200 languages with MoE, new low-resource mining, human interviews, and open release, but the 44% BLEU claim needs ablations to separate data from model gains.

read the letter

The main thing to know is that the NLLB team built and released a sparsely gated MoE model covering 200 languages, trained with custom data mining for low-resource pairs, and evaluated it on the new human-translated Flores-200 benchmark using both automatic metrics and human raters plus a toxicity check across all directions. They report a 44% relative BLEU improvement over prior state-of-the-art and open-source the code and models at the fairseq repo.

Referee Report

3 major / 1 minor

Summary. The manuscript presents the No Language Left Behind (NLLB) project, which develops machine translation systems supporting 200 languages with emphasis on low-resource directions. It begins with interviews of native speakers, introduces novel data-mining techniques for low-resource data, trains a Sparsely Gated Mixture-of-Experts model augmented with architectural and training changes to mitigate overfitting across thousands of tasks, and evaluates performance on the human-translated Flores-200 benchmark. The central empirical claim is a 44% relative BLEU improvement over prior state-of-the-art, supported by large-scale human evaluation and a new toxicity benchmark covering all 200 languages; all models, code, and data-mining pipelines are open-sourced.

Significance. If the reported gains prove robust and attributable to the proposed methods rather than data or evaluation artifacts, the work would constitute a substantial advance toward inclusive, high-coverage MT. Strengths include the scale of human evaluation, the introduction of a toxicity benchmark for safety assessment across all languages, the human-centered framing via speaker interviews, and the commitment to open-sourcing. These elements provide concrete resources that could accelerate follow-on research on low-resource translation.

major comments (3)

[Abstract] Abstract: the headline claim of a 44% relative BLEU improvement is presented without any ablation that isolates the contribution of the Sparsely Gated MoE architecture and anti-overfitting training changes from the novel data-mining pipeline. Because the test set (Flores-200) is human-translated and the training data are mined from the same broad web sources, the numerical gain cannot yet be confidently attributed to the model innovations rather than improved data quality or distributional overlap.
[Abstract] Abstract (evaluation paragraph): the aggregate 44% BLEU figure is reported over >40,000 directions, yet no per-language variance, confidence intervals, or statistical significance tests for the relative improvement are referenced. Without these controls it is impossible to determine whether the headline number is driven by a small number of high-resource directions or reflects consistent gains on the low-resource languages that motivate the work.
[Abstract] Abstract: the toxicity benchmark is described as covering all Flores-200 languages and is used to assess translation safety, but the abstract supplies no information on the toxicity classifier, annotation protocol, or decision thresholds. This omission is load-bearing for the safety claims that accompany the performance numbers.

minor comments (1)

[Abstract] The abstract introduces the model as a 'conditional compute model based on Sparsely Gated Mixture of Experts' without immediately clarifying the relationship between the two phrases; a single sentence linking the terms would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract of our NLLB manuscript. We appreciate the recognition of the project's scale, human evaluation, toxicity benchmark, and open-sourcing. All major comments concern the abstract, which we will revise for greater precision and self-containment while preserving its summary nature. We respond point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of a 44% relative BLEU improvement is presented without any ablation that isolates the contribution of the Sparsely Gated MoE architecture and anti-overfitting training changes from the novel data-mining pipeline. Because the test set (Flores-200) is human-translated and the training data are mined from the same broad web sources, the numerical gain cannot yet be confidently attributed to the model innovations rather than improved data quality or distributional overlap.

Authors: We agree that stronger isolation of contributions would be valuable. The manuscript presents the data-mining pipeline and the Sparsely Gated MoE model (with anti-overfitting changes) as complementary elements developed for the 200-language setting, with the 44% gain measured against prior SOTA systems lacking both. The full text contains separate sections detailing each and direct comparisons to prior work. We will revise the abstract to explicitly state that the reported improvement results from the integrated pipeline and to direct readers to the relevant sections for component-wise analysis. Adding exhaustive new ablations at this scale would require substantial additional compute; we therefore treat this as a partial revision focused on abstract clarity. revision: partial
Referee: [Abstract] Abstract (evaluation paragraph): the aggregate 44% BLEU figure is reported over >40,000 directions, yet no per-language variance, confidence intervals, or statistical significance tests for the relative improvement are referenced. Without these controls it is impossible to determine whether the headline number is driven by a small number of high-resource directions or reflects consistent gains on the low-resource languages that motivate the work.

Authors: The full manuscript and appendices report per-language BLEU scores, variance across directions, and human evaluation results demonstrating that gains are largest and most consistent for low-resource languages. Statistical support comes from the scale of the Flores-200 human evaluations. We will revise the abstract to note that the aggregate figure reflects consistent improvements on low-resource directions, as validated by the detailed per-direction and human assessments presented in the body of the paper. revision: yes
Referee: [Abstract] Abstract: the toxicity benchmark is described as covering all Flores-200 languages and is used to assess translation safety, but the abstract supplies no information on the toxicity classifier, annotation protocol, or decision thresholds. This omission is load-bearing for the safety claims that accompany the performance numbers.

Authors: We agree the abstract should be more self-contained on this point. The manuscript details a multilingual toxicity classifier fine-tuned on human-annotated data collected from native speakers for each language, together with the annotation protocol and thresholds calibrated via human validation. We will revise the abstract to include a concise description of the toxicity benchmark methodology, noting the classifier, native-speaker annotation, and coverage of all 200 languages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result with no self-referential derivations

full rationale

The paper reports an engineering achievement: a sparsely-gated MoE model trained on newly mined low-resource data, with listed anti-overfitting changes, evaluated on the human-translated Flores-200 benchmark. The 44% relative BLEU figure is a direct measurement on held-out test data rather than a prediction derived from fitted parameters or prior self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed improvement to the inputs by construction. The work is self-contained as an empirical report; any self-citations are incidental and not load-bearing for the central numerical claim.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The performance claim rests on the assumption that the mined data is representative and that the chosen MoE hyperparameters and regularization successfully prevent overfitting on low-resource pairs; these are not derived from first principles.

free parameters (2)

number of experts and expert capacity
Hyperparameters of the sparsely gated MoE that control compute allocation across thousands of language pairs.
regularization coefficients and training schedule
Values chosen to counteract overfitting when training on many low-resource tasks simultaneously.

axioms (2)

domain assumption Mined parallel data for low-resource languages is sufficiently clean and representative for supervised training.
Invoked when claiming that novel mining techniques narrow the performance gap.
domain assumption Human raters and the toxicity classifier provide reliable safety signals across all 200 languages.
Required for the claim that translations are both high-quality and safe.

pith-pipeline@v0.9.0 · 5746 in / 1452 out tokens · 36822 ms · 2026-05-12T17:48:09.914595+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
cs.CL 2026-05 unverdicted novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization
cs.LO 2026-05 unverdicted novelty 7.0

Encoder-decoder transformers are characterized by a temporal logic extending propositional logic with a counting global modality on the encoder and a past modality on the decoder, equivalently via distributed automata.
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
cs.CL 2026-04 unverdicted novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cos...
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
cs.CL 2026-04 unverdicted novelty 7.0

MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.
Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models
cs.CL 2026-04 unverdicted novelty 7.0

Litmus (Re)Agent, a structured agentic system, outperforms baselines in predicting multilingual model performance from incomplete evidence on a new controlled benchmark.
One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging
cs.CL 2026-04 unverdicted novelty 7.0

Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
cs.LG 2026-05 conditional novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
cs.CL 2026-05 conditional novelty 6.0

ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to transla...
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
cs.IR 2026-05 unverdicted novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation
cs.LG 2026-05 unverdicted novelty 6.0

Attention sinks in NLLB-200 cross-attention cause non-content tokens to dominate 83-91% of mass, halving apparent content similarity; content filtering recovers linguistic signals like language clustering and mode dif...
RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
cs.CL 2026-04 unverdicted novelty 6.0

RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
cs.LG 2026-04 unverdicted novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
cs.CL 2026-04 unverdicted novelty 6.0

MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and em...
Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service
cs.CR 2026-04 unverdicted novelty 6.0

GeoMark decouples local watermark triggering from centralized ownership attribution using geometry-separated anchors and adaptive neighborhoods to improve robustness against paraphrasing, dimension changes, and cluste...
Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
cs.CL 2026-04 unverdicted novelty 6.0

A metadata-conditioned mT5 model trained on rule-augmented dialectal Arabic data produces translations that better match intended regional varieties than high-resource baselines, despite lower BLEU scores.
CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training
cs.CL 2026-04 unverdicted novelty 6.0

CLEAR is a reverse-training loss that improves cross-lingual retrieval performance by up to 15% in low-resource languages while minimizing degradation in English by using English as an alignment bridge.
YoNER: A New Yor\`ub\'a Multi-domain Named Entity Recognition Dataset
cs.CL 2026-04 accept novelty 6.0

YoNER supplies a multi-domain Yoruba NER corpus of 5k sentences plus OyoBERT, showing African-centric models beat multilingual baselines in-domain while cross-domain performance drops sharply for blogs and movies.
Toward Culturally Grounded Natural Language Processing
cs.CL 2026-03 unverdicted novelty 6.0

Culturally grounded NLP must shift from isolated language benchmarks to modeling communicative ecologies that encompass institutions, scripts, domains, modalities, and communities.
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 5.0

CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP
cs.CL 2026-04 unverdicted novelty 5.0

Data augmentation via LLMs and back-translation produces task-specific effects on NER and POS tagging for Hausa and Fongbe, with no consistent gains over baseline and opposite outcomes across tasks for the same synthe...
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Testing the Assumptions of Active Learning for Translation Tasks with Few Samples
cs.CL 2026-04 unverdicted novelty 5.0

Informativeness and diversity of samples selected by active learning show no correlation with test performance on translation tasks using few samples; ordering and pre-training effects dominate instead.
Multilingual E5 Text Embeddings: A Technical Report
cs.CL 2024-02 unverdicted novelty 5.0

Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
cs.CL 2026-05 conditional novelty 4.0

Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
cs.CL 2026-05 unverdicted novelty 4.0

MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
Phoenix-VL 1.5 Medium Technical Report
cs.CL 2026-05 unverdicted novelty 3.0

Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...
AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering
cs.CE 2026-04 unverdicted novelty 3.0

A modular XR platform integrates Whisper, NLLB, AWS Polly, RoBERTa, flan-t5, and MediaPipe to deliver real-time multilingual and International Sign support for education, with benchmarks showing AWS Polly's low latenc...
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 30 Pith papers · 2 internal anchors

[1]

URL https://arxiv.org/abs/2110.03036. Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, and John Quinn. Machine translation for african languages: Community creation of datasets and models in uganda. In 3rd Workshop on African Natural Lan...

work page arXiv
[2]

Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R

URL https://openreview.net/forum?id=BK-z5qzEU-9. Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan,...

work page 2021
[3]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

JMLR.org, 2016. Faisal Alshargi, Shahd Dibas, Sakhar Alkhereyf, Reem Faraj, Basmah Abdulkareem, Sane Yagi, Ouafaa Kacha, Nizar Habash, and Owen Rambow. Morphologically annotated corpora for seven Arabic dialects: Taizi, sanaani, najdi, jordanian, syrian, iraqi and Moroccan. In Proceedings of the Fourth Arabic Natural Language Processing Workshop , 2019. C...

work page doi:10.18653/v1/ 2016
[4]

doi: 10.18653/v1/2021.iwslt-1.1

Association for Computational Linguistics. doi: 10.18653/v1/2021.iwslt-1.1. URL https://aclanthology.org/2021.iwslt-1.1. Patrick Andries. Proposition d’ajout de l’écriture tifinaghe. Organisation internationale de normalisation. Jeu universel des caractères codés sur octets (JUC). ORGANISATION INTERNATIONALE DE NORMALISATION, 2004. Mohd Zeeshan Ansari, M....

work page doi:10.18653/v1/2021.iwslt-1.1 2021
[5]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

URL http://arxiv.org/abs/1308.3432. Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, and Rifat Shahriyar. Banglanlg: Benchmarks and resources for evaluating low-resource natural language generation in bangla. arXiv preprint arXiv:2205.11081 , 2022. Steven Bird. Designing for language revitalisation. In Gilles Adda, Khalid Choukri, Irm- garda Kasinskai...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.485 2022
[6]

Kevin Degila, Godson Kalipe, Jamiil Touré Ali, and Momboladji Balogoun

ISBN 9788858113622. Kevin Degila, Godson Kalipe, Jamiil Touré Ali, and Momboladji Balogoun. Parallel text dataset for Neural Machine Translation (French -> Fongbe, French -> Ewe), November

work page
[7]

Stefano Demichelis and Jorgen W Weibull

URL https://doi.org/10.5281/zenodo.4266935. Stefano Demichelis and Jorgen W Weibull. Language, meaning, and games: A model of communication, coordination, and evolution. American Economic Review, 98(4):1292–1311, 2008. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language under...

work page doi:10.5281/zenodo.4266935 2008
[8]

Jukebox: A Generative Model for Music

URL https://aclanthology.org/N19-1423. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 , 2020. Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A Smith, Nicole...

work page Pith review arXiv 2005
[9]

low-resource

Technical report, World Health Organization, May 2017. Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixo...

work page doi:10.18653/v1/2022.acl-long.435 2017
[10]

multilingual

Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https://aclanthology.org/W18-6319. Manasa Prasad, Theresa Breiner, and Daan van Esch. Mining training data for language modeling across the world’s languages. In SLTU, pages 61–65, 2018. Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. BPE-dropout: Simple and effective subword ...

work page doi:10.18653/v1/w18-6319 2018
[11]

Shashi Shekhar, Dilip Kumar Sharma, and MM Sufyan Beg

URL https://openreview.net/pdf?id=B1ckMDqlg. Shashi Shekhar, Dilip Kumar Sharma, and MM Sufyan Beg. Language identification framework in code-mixed social media text based on quantum lstm —the word belongs to which language? Modern Physics Letters B , 34(06):2050086, 2020. Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Xu Chen, Sneha Kudugunta, ...

work page doi:10.18653/v1/2020.emnlp-main.75 2020
[12]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.163. URL https://aclanthology.org/2021.eacl-main.163. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Y...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.eacl-main.163 2021
[13]

civilizing

Association for Computational Linguistics. doi: 10.18653/v1/D16-1163. URL https://aclanthology.org/D16-1163. Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. Designing effective sparse expert models. arXiv preprint arXiv:2202.08906, 2022. Shoshana Zuboff. The age of surveillance capitalism: The fig...

work page doi:10.18653/v1/d16-1163 2022
[14]

(2018) 3 4 AI4D Degila et al

https://github.com/facebookresearch/fairseq/tree/nllb/data 175 Corpus Name Citation # Directions # Languages AAU Ethiopian Languages Abate et al. (2018) 3 4 AI4D Degila et al. (2020); Siminyu et al. (2021) 3 5 DGT Tiedemann (2012) 94 24 ECB Tiedemann (2012) 74 19 EMEA Tiedemann (2012) 86 22 English-Twi Azunre et al. (2021a,b) 2 1 EU Bookshop Skadiņš et al...

work page 2018
[15]

We compare non-English-centric performance in this table

4-phase curriculum : (a) Step 0−170k: https://github.com/facebookresearch/fairseq/tree/nllb/examples/ nllb/modeling/scripts/flores200/final_lang_pairs_cl3.txt (b) Step 170k−230k: https://github.com/facebookresearch/fairseq/tree/nllb/examples/ nllb/modeling/scripts/flores200/final_lang_pairs_cl2.txt 177 MMTAfrica NLLB-200 ibo_Latn-swh_Latn 21.8/37.3 22.0/4...

work page
[16]

Naive 2-phase curriculum : (a) Step 0−200k: https://github.com/facebookresearch/fairseq/tree/nllb/examples/ nllb/modeling/scripts/flores200/cl1_lang_pairs.txt (b) Step 200k−300k: https://github.com/facebookresearch/fairseq/tree/nllb/examples/ nllb/modeling/scripts/flores200/lang_pairs.txt E.2 Results E.2.1 Performance on African Languages In Table 53, we ...

work page 2021