CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs

Abhik Bhattacharjee; Rifat Shahriyar; Tahmid Hasan; Wasi Uddin Ahmad; Yong-Bin Kang; Yuan-Fang Li

arxiv: 2112.08804 · v3 · pith:SW7M3AJSnew · submitted 2021-12-16 · 💻 cs.CL

CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs

Abhik Bhattacharjee , Tahmid Hasan , Wasi Uddin Ahmad , Yuan-Fang Li , Yong-Bin Kang , Rifat Shahriyar This is my paper

classification 💻 cs.CL

keywords cross-lingualsummarizationcrosssumdatasetlanguagelaserougeevaluation

0 comments

read the original abstract

We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first ever that is not centered around English. We are releasing the dataset, training and evaluation scripts, and models to spur future research on cross-lingual summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
cs.CL 2026-04 unverdicted novelty 5.0

A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over str...