Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Krzysztof Marasek; Krzysztof Wo{\l}k

arxiv: 1509.08881 · v1 · pith:EUUMA6VFnew · submitted 2015-09-29 · 💻 cs.CL · cs.IR· stat.ML

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Krzysztof Wo{\l}k , Krzysztof Marasek This is my paper

classification 💻 cs.CL cs.IRstat.ML

keywords comparableparallelcorporapairssentencesentencesbuildingdata

0 comments

read the original abstract

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.

This paper has not been read by Pith yet.

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

discussion (0)