Unsupervised Machine Translation Using Monolingual Corpora Only

Guillaume Lample , Alexis Conneau , Ludovic Denoyer , Marc'Aurelio Ranzato

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords parallelcorporamodelwithoutdatadatasetsevenlanguage

read the original abstract

Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, without using even a single parallel sentence at training time.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
cs.SE 2020-09 conditional novelty 7.0

CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
cs.CL 2018-08 accept novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.