The paper releases a new 2.78M-sentence English-Marathi parallel corpus and reports that corpus-level deduplication produces the largest single gain in downstream NMT performance among tested preprocessing steps.
InProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 114–123, Virtual
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation
The paper releases a new 2.78M-sentence English-Marathi parallel corpus and reports that corpus-level deduplication produces the largest single gain in downstream NMT performance among tested preprocessing steps.