WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

Claire Cardie; Esin Durmus; Faisal Ladhak; Kathleen McKeown

arxiv: 2010.03093 · v1 · pith:6OFKSPTInew · submitted 2020-10-07 · 💻 cs.CL

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

Faisal Ladhak , Esin Durmus , Claire Cardie , Kathleen McKeown This is my paper

classification 💻 cs.CL

keywords summarizationabstractivedatasetarticlecross-lingualcrosslingualfurtherhow-to

0 comments

read the original abstract

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.