pith. sign in

arxiv: 1905.08212 · v1 · pith:IM7IQZBSnew · submitted 2019-05-20 · 💻 cs.CL

Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation

classification 💻 cs.CL
keywords datalow-resourcemultilingualsamplingtargettrainingconditionedimprove
0
0 comments X
read the original abstract

To improve low-resource Neural Machine Translation (NMT) with multilingual corpora, training on the most related high-resource language only is often more effective than using all data available (Neubig and Hu, 2018). However, it is possible that an intelligent data selection strategy can further improve low-resource NMT with data from other auxiliary languages. In this paper, we seek to construct a sampling distribution over all multilingual data, so that it minimizes the training loss of the low-resource language. Based on this formulation, we propose an efficient algorithm, Target Conditioned Sampling (TCS), which first samples a target sentence, and then conditionally samples its source sentence. Experiments show that TCS brings significant gains of up to 2 BLEU on three of four languages we test, with minimal training overhead.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.