A Large Parallel Corpus of Full-Text Scientific Articles

Felipe Soares; Karin Becker; Viviane Pereira Moreira

arxiv: 1905.01852 · v1 · pith:7QFWEOWUnew · submitted 2019-05-06 · 💻 cs.CL

A Large Parallel Corpus of Full-Text Scientific Articles

Felipe Soares , Viviane Pereira Moreira , Karin Becker This is my paper

classification 💻 cs.CL

keywords articlescorpusparallellanguagescieloscientificalignedarticle

0 comments

read the original abstract

The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
cs.CL 2026-05 conditional novelty 4.0

Development of domain-specific scientific corpora for English-Spanish, English-French, and English-Portuguese and their application to fine-tuning NMT models.