Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Kenji Hilasaca; Nouran Khallaf; Serge Sharoff

arxiv: 2605.09476 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.AI

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Kenji Hilasaca , Nouran Khallaf , Serge Sharoff This is my paper

Pith reviewed 2026-05-12 02:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords text simplificationsentence alignmentmultilingual corporacomparable corporacrowdsourcingdataset construction

0 comments

The pith

Crowd-sourced simplifications from comparable documents can be aligned at sentence level to create usable multilingual text simplification data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an experimental method for turning document-level simplifications collected from comparable corpora into sentence-aligned pairs. The process covers five languages and produces a dataset intended for both training and testing simplification systems. The authors describe alignment mechanisms that operate without requiring originally parallel texts. The resulting pairs are released publicly so other researchers can use them directly.

Core claim

Sentence-level alignment applied to crowd-sourced simplifications collected from comparable document pairs yields a clean, high-quality multilingual simplification corpus covering Catalan, English, French, Italian and Spanish that supports both model training and evaluation.

What carries the argument

Sentence-level alignment mechanisms that map simplifications across comparable (not parallel) documents.

If this is right

The aligned pairs can be used directly to train simplification models for the five languages.
The same alignment approach can be applied to build similar resources for additional languages.
Evaluation of simplification systems can now draw on consistent multilingual test data instead of English-only resources.
Document-level comparable corpora become a viable source for creating training data when true parallel texts are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alignment step may reduce the cost of creating simplification data compared with collecting new parallel texts from scratch.
Similar alignment techniques could be tested on other generation tasks that start from comparable rather than parallel sources.
If the pairs prove high quality, they could support studies of how simplification strategies differ across the five languages.

Load-bearing premise

Crowd-sourced simplifications gathered from comparable documents can be aligned at the sentence level to produce pairs that are both accurate and low in noise.

What would settle it

Manual inspection of a random sample of the released pairs revealing frequent mismatches or simplifications that do not preserve core meaning.

Figures

Figures reproduced from arXiv: 2605.09476 by Kenji Hilasaca, Nouran Khallaf, Serge Sharoff.

**Figure 1.** Figure 1: Plotting Strict F1-scores across cosine similarity thresholds (τ ) for the five languages for each embedding method. The black dots indicate the optimal threshold that maximizes the F1-score for each language and for each method. Notice the overall height superiority of LaBSE, the rightward shift of the BGE peaks reflecting its similarity distribution, and the compressed performance of SONAR. bility across… view at source ↗

read the original abstract

Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a new sentence-aligned simplification corpus for five languages from crowd-sourced comparable data, but the description gives almost no numbers on alignment quality or dataset properties.

read the letter

This paper mainly gives us a new publicly available sentence-aligned simplification corpus covering Catalan, English, French, Italian, and Spanish. The authors gathered crowd-sourced simplifications from comparable documents and worked out ways to align them at the sentence level. The new part is extending this kind of data collection to those languages and releasing the results. Most simplification work stays in English, so this fills a practical gap for multilingual efforts. Describing the alignment mechanisms from document-level data is straightforward and could help others doing similar work. They handle the data release well by making it public, which is the kind of thing that lets the community build on it. The soft spots come from the lack of supporting numbers. The abstract talks about high-quality corpora but does not include any alignment accuracy scores, dataset statistics, or model evaluations. Without those details, it is difficult to assess how clean the pairs really are or how much noise they contain. The core idea that crowd-sourced data from comparable sources can produce good simplification pairs is reasonable, but it stays untested in the provided description. This paper is aimed at researchers who need training data for text simplification in multiple languages. Someone looking for theoretical advances or strong empirical results will not find them here. A reader focused on resources and data construction will get more out of it. It deserves a serious referee because data papers like this can be valuable if the quality holds up. I would recommend sending it to peer review, but the reviewers should push for quantitative checks on the alignments and perhaps some simple baseline results to show the data works.

Referee Report

2 major / 2 minor

Summary. The paper reports an experimental study on collecting crowd-sourced text simplifications from comparable (non-parallel) corpora in five languages (Catalan, English, French, Italian, Spanish), describes sentence-level alignment mechanisms applied to document-level data, and releases the resulting aligned sentence pairs as a public dataset intended for training and evaluating multilingual text simplification systems.

Significance. If the alignment process yields low-noise, high-fidelity simplification pairs, the released multilingual dataset would address a documented scarcity of resources beyond English and support both monolingual and cross-lingual simplification research. The public availability itself is a clear positive for reproducibility.

major comments (2)

[Abstract] Abstract: The manuscript repeatedly describes the output as 'high-quality' sentence-aligned corpora suitable for training and testing, yet provides no quantitative evidence of alignment accuracy (e.g., precision/recall against gold alignments), simplification quality metrics, inter-annotator agreement, or even basic dataset statistics such as number of pairs per language or average compression ratio. Without these, the central claim that the resource is high-quality cannot be evaluated.
[Alignment mechanism description] Section describing alignment (likely §3 or §4): The sentence-alignment procedure from comparable document-level data is presented as a key contribution, but the text supplies no validation experiments, error analysis, or comparison against standard aligners (e.g., Vecalign or Gale-Church). This leaves open whether the method reliably filters non-parallel content or introduces systematic noise that would affect downstream model training.

minor comments (2)

Add a table (or appendix) with per-language statistics: number of documents, aligned pairs, average sentence lengths before/after simplification, and any filtering thresholds used.
[Abstract] The abstract states the dataset 'is publicly available' but the manuscript should include an explicit URL, license, and citation format in the main text or a dedicated data-availability statement.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative support for our claims about dataset quality and alignment reliability. We address each major comment below, indicating where we agree revisions are warranted.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript repeatedly describes the output as 'high-quality' sentence-aligned corpora suitable for training and testing, yet provides no quantitative evidence of alignment accuracy (e.g., precision/recall against gold alignments), simplification quality metrics, inter-annotator agreement, or even basic dataset statistics such as number of pairs per language or average compression ratio. Without these, the central claim that the resource is high-quality cannot be evaluated.

Authors: We agree that the abstract and title use the term 'high-quality' without supporting quantitative metrics, which weakens the central claim. The manuscript emphasizes the crowd-sourced collection process and alignment procedure from comparable documents, with the public release intended to enable further evaluation by the community. Basic statistics (e.g., pair counts per language) are available in the released dataset but were not detailed in the paper. We will revise the abstract to qualify the claim (e.g., 'aligned sentence pairs derived from crowd-sourced simplifications') and add a dedicated table or subsection reporting dataset statistics including number of pairs, average lengths, and compression ratios. However, no gold-standard alignments or inter-annotator agreement were collected during the project, so we cannot provide precision/recall or IAA figures without new annotation. revision: partial
Referee: [Alignment mechanism description] Section describing alignment (likely §3 or §4): The sentence-alignment procedure from comparable document-level data is presented as a key contribution, but the text supplies no validation experiments, error analysis, or comparison against standard aligners (e.g., Vecalign or Gale-Church). This leaves open whether the method reliably filters non-parallel content or introduces systematic noise that would affect downstream model training.

Authors: The alignment procedure is described in Section 3 as a hybrid lexical-semantic approach applied to document-level comparable data. We did not include validation experiments or direct comparisons to aligners such as Vecalign or Gale-Church, as the paper's primary focus is the overall data collection pipeline and public release rather than a benchmarking study of alignment algorithms. We acknowledge this leaves the reliability open to question and will add a limitations subsection discussing potential noise sources in the alignment step along with a small-scale manual error analysis on sampled pairs. Comprehensive comparative experiments would require substantial additional resources beyond the scope of the original work. revision: partial

standing simulated objections not resolved

Precision/recall metrics for alignment accuracy against gold standards, as no gold alignments were annotated during data collection.
Inter-annotator agreement for the crowd-sourced simplifications, which was not computed in the original project.

Circularity Check

0 steps flagged

Empirical data collection with no derivations or predictions

full rationale

The paper is a report of an experimental study involving crowd-sourced data collection from comparable corpora, sentence alignment mechanisms, and public dataset release across five languages. No equations, fitted parameters, predictions, or theoretical derivations are present that could reduce to inputs by construction. The work is self-contained factual reporting of a process, with no self-citation chains or ansatzes invoked for load-bearing claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical corpus-construction study with no mathematical derivations. No free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.0 · 5407 in / 1224 out tokens · 51007 ms · 2026-05-12T02:58:02.959018+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We report mechanisms for sentence-level alignment from document-level data... SentAlign... LaBSE... BGE-M3... SONAR... cosine similarity threshold (τ)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jcost uniqueness and phi-ladder derivations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Data needed for training automatic text sim- plification tools are based on aligned sen- tences

Introduction Automatic text simplification plays a crucial role in improving the accessibility and com- prehensibility of written information for diverse audiences, including language learners and readers with limited literacy (Saggion, 2017). Data needed for training automatic text sim- plification tools are based on aligned sen- tences. This alignment a...

work page 2017
[2]

Related studies Early work on machine translation highlighted both the value and the limitations of domain- specific resources such as the European Par- liament corpus (Koehn, 2005) and the United Nations corpus (Ziemski et al., 2016). The lim- itations on the amount and diversity of texts motivated large-scale mining from comparable corpora, a line of re...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

First, we evalu- ate traditional surface-level baselines against modern semantic embedding methods using a manually annotated gold standard

Methodology Our study proposes a two-phase methodology for identifying parallel sentences in document- aligned simplification corpora. First, we evalu- ate traditional surface-level baselines against modern semantic embedding methods using a manually annotated gold standard. This eval- uation phase allows us to identify the best- performing embedding mode...

work page 1993
[4]

Heuristic Pre-selection:SentAlign re- duces the search space by generating candidate alignments using the Gale and Church (1993) algorithm, which relies on character length ratios

work page 1993
[5]

Those exceeding a high-confidence cosine similarity thresh- old becomeanchors

Semantic Anchoring:Candidates are validated using the chosen embedding model (Section 3.3). Those exceeding a high-confidence cosine similarity thresh- old becomeanchors. This establishes fixed points in the document map that par- tition the text into smaller segments

work page
[6]

Global Optimization:The algorithm aligns segments between anchors using Dijkstra’s shortest path algorithm, with costs derived from the cosine similarity matrix. We configure this stage to prior- itize the simplified document as the ref- erence, enabling an asymmetric search that retrieves the closest semantic equiva- lent(s) in the complex document for e...

work page 2022
[7]

children and anyone seeking easy-to-read content

Evaluation Setup We start with the initial corpus, which has been crawled from Vikidia2, a website that maintains Wikipedia-style content aimed at “children and anyone seeking easy-to-read content”. For each Vikidia document, we added the corre- sponding Wikipedia article in the same lan- guage to form comparable document pairs. Stub articles (with little...

work page 2020
[8]

topical illusion

Results and Discussion 5.1. Quantitative Analysis Table 2 presents the comprehensive alignment performance across all five languages. We report the two surface-feature baselines along- side the SentAlign framework instantiated with three embedding spaces (LaBSE, BGE-M3, and SONAR). For each neural model, we in- clude both raw output (without threshold fil...

work page arXiv 2020
[9]

Conclusions This study provides the first systematic com- parison of semantic embedding spaces for sentence-level alignment in multilingual text simplification, a gap previously unaddressed in the literature. We demonstrate that LaBSE’s translation-ranking objective transfers robustly to monolingual paraphrase detection across Romance languages, while BGE...

work page
[10]

First, our strict thresholding strategy (τ) prioritizes precision over recall, discarding approximately 95% of the original sentences

Limitations While our pipeline successfully extracts high- precision parallel corpora, it has notable limi- tations. First, our strict thresholding strategy (τ) prioritizes precision over recall, discarding approximately 95% of the original sentences. Although this ensures a noise-free dataset, it inevitably filters out valid but highly abstract simplific...

work page
[11]

We utilize publicly available, crowdsourced data from Wikipedia and Vikidia, strictly adhering to their Creative Commons (CC-BY -SA) licenses

Ethics Statement This research complies with standard ethical guidelines for NLP . We utilize publicly available, crowdsourced data from Wikipedia and Vikidia, strictly adhering to their Creative Commons (CC-BY -SA) licenses. Given the encyclope- dic nature of the texts, the dataset contains no personally identifiable information (PII) or sensitive person...

work page
[12]

101132431 (iDEM Project)

Acknowledgments This document is part of a project that has received funding from the European Union’s Horizon Europe research and innovation pro- gram under Grant Agreement No. 101132431 (iDEM Project). The University of Leeds was funded by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (Grant Agreement No. 1...

work page
[13]

Bibliographical References Sisay Fissaha Adafre and Maarten de Rijke

work page
[14]

Sonar: Sentence-level multimodal and language-agnostic representations

Finding similar sentences across mul- tiple languages in Wikipedia. InProceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources. Fernando Alva-Manchego, Louis Martin, An- toine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia. 2020. ASSET: A dataset for tuning and evaluation of sen- tence simplification models with mul...

work page arXiv 2020
[15]

Michael J Ryan, Tarek Naous, and Wei Xu

A comparative study of sentence alignment methods for Spanish text simpli- fication.Language Resources and Evalua- tion, 60(2):29. Michael J Ryan, Tarek Naous, and Wei Xu

work page
[16]

InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4898–4927, Toronto, Canada

Revisiting non-English text simpli- fication: A unified multilingual benchmark. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4898–4927, Toronto, Canada. Association for Computational Linguistics. Horacio Saggion. 2017.Automatic text sim- plification. Synthesis Lectures on Human ...

work page 2017
[17]

Serge Sharoff, Reinhard Rapp, and Pierre Zweigenbaum

WikiMatrix: Mining 135m parallel sentences in 1620 language pairs from Wikipedia.arXiv preprint arXiv:1907.05791. Serge Sharoff, Reinhard Rapp, and Pierre Zweigenbaum. 2023.Building and Using Comparable Corpora for Multilingual Natu- ral Language Processing. Synthesis Lec- tures on Human Language Technologies. Springer Nature. Sanja Štajner, Marc Franco-S...

work page arXiv 1907
[18]

InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649

C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Weinberger, and Y oav Artzi. 2020. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning ...

work page 2020

[1] [1]

Data needed for training automatic text sim- plification tools are based on aligned sen- tences

Introduction Automatic text simplification plays a crucial role in improving the accessibility and com- prehensibility of written information for diverse audiences, including language learners and readers with limited literacy (Saggion, 2017). Data needed for training automatic text sim- plification tools are based on aligned sen- tences. This alignment a...

work page 2017

[2] [2]

Related studies Early work on machine translation highlighted both the value and the limitations of domain- specific resources such as the European Par- liament corpus (Koehn, 2005) and the United Nations corpus (Ziemski et al., 2016). The lim- itations on the amount and diversity of texts motivated large-scale mining from comparable corpora, a line of re...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[3] [3]

First, we evalu- ate traditional surface-level baselines against modern semantic embedding methods using a manually annotated gold standard

Methodology Our study proposes a two-phase methodology for identifying parallel sentences in document- aligned simplification corpora. First, we evalu- ate traditional surface-level baselines against modern semantic embedding methods using a manually annotated gold standard. This eval- uation phase allows us to identify the best- performing embedding mode...

work page 1993

[4] [4]

Heuristic Pre-selection:SentAlign re- duces the search space by generating candidate alignments using the Gale and Church (1993) algorithm, which relies on character length ratios

work page 1993

[5] [5]

Those exceeding a high-confidence cosine similarity thresh- old becomeanchors

Semantic Anchoring:Candidates are validated using the chosen embedding model (Section 3.3). Those exceeding a high-confidence cosine similarity thresh- old becomeanchors. This establishes fixed points in the document map that par- tition the text into smaller segments

work page

[6] [6]

Global Optimization:The algorithm aligns segments between anchors using Dijkstra’s shortest path algorithm, with costs derived from the cosine similarity matrix. We configure this stage to prior- itize the simplified document as the ref- erence, enabling an asymmetric search that retrieves the closest semantic equiva- lent(s) in the complex document for e...

work page 2022

[7] [7]

children and anyone seeking easy-to-read content

Evaluation Setup We start with the initial corpus, which has been crawled from Vikidia2, a website that maintains Wikipedia-style content aimed at “children and anyone seeking easy-to-read content”. For each Vikidia document, we added the corre- sponding Wikipedia article in the same lan- guage to form comparable document pairs. Stub articles (with little...

work page 2020

[8] [8]

topical illusion

Results and Discussion 5.1. Quantitative Analysis Table 2 presents the comprehensive alignment performance across all five languages. We report the two surface-feature baselines along- side the SentAlign framework instantiated with three embedding spaces (LaBSE, BGE-M3, and SONAR). For each neural model, we in- clude both raw output (without threshold fil...

work page arXiv 2020

[9] [9]

Conclusions This study provides the first systematic com- parison of semantic embedding spaces for sentence-level alignment in multilingual text simplification, a gap previously unaddressed in the literature. We demonstrate that LaBSE’s translation-ranking objective transfers robustly to monolingual paraphrase detection across Romance languages, while BGE...

work page

[10] [10]

First, our strict thresholding strategy (τ) prioritizes precision over recall, discarding approximately 95% of the original sentences

Limitations While our pipeline successfully extracts high- precision parallel corpora, it has notable limi- tations. First, our strict thresholding strategy (τ) prioritizes precision over recall, discarding approximately 95% of the original sentences. Although this ensures a noise-free dataset, it inevitably filters out valid but highly abstract simplific...

work page

[11] [11]

We utilize publicly available, crowdsourced data from Wikipedia and Vikidia, strictly adhering to their Creative Commons (CC-BY -SA) licenses

Ethics Statement This research complies with standard ethical guidelines for NLP . We utilize publicly available, crowdsourced data from Wikipedia and Vikidia, strictly adhering to their Creative Commons (CC-BY -SA) licenses. Given the encyclope- dic nature of the texts, the dataset contains no personally identifiable information (PII) or sensitive person...

work page

[12] [12]

101132431 (iDEM Project)

Acknowledgments This document is part of a project that has received funding from the European Union’s Horizon Europe research and innovation pro- gram under Grant Agreement No. 101132431 (iDEM Project). The University of Leeds was funded by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (Grant Agreement No. 1...

work page

[13] [13]

Bibliographical References Sisay Fissaha Adafre and Maarten de Rijke

work page

[14] [14]

Sonar: Sentence-level multimodal and language-agnostic representations

Finding similar sentences across mul- tiple languages in Wikipedia. InProceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources. Fernando Alva-Manchego, Louis Martin, An- toine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia. 2020. ASSET: A dataset for tuning and evaluation of sen- tence simplification models with mul...

work page arXiv 2020

[15] [15]

Michael J Ryan, Tarek Naous, and Wei Xu

A comparative study of sentence alignment methods for Spanish text simpli- fication.Language Resources and Evalua- tion, 60(2):29. Michael J Ryan, Tarek Naous, and Wei Xu

work page

[16] [16]

InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4898–4927, Toronto, Canada

Revisiting non-English text simpli- fication: A unified multilingual benchmark. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4898–4927, Toronto, Canada. Association for Computational Linguistics. Horacio Saggion. 2017.Automatic text sim- plification. Synthesis Lectures on Human ...

work page 2017

[17] [17]

Serge Sharoff, Reinhard Rapp, and Pierre Zweigenbaum

WikiMatrix: Mining 135m parallel sentences in 1620 language pairs from Wikipedia.arXiv preprint arXiv:1907.05791. Serge Sharoff, Reinhard Rapp, and Pierre Zweigenbaum. 2023.Building and Using Comparable Corpora for Multilingual Natu- ral Language Processing. Synthesis Lec- tures on Human Language Technologies. Springer Nature. Sanja Štajner, Marc Franco-S...

work page arXiv 1907

[18] [18]

InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649

C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Weinberger, and Y oav Artzi. 2020. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning ...

work page 2020