BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

Abhi Mehta; Akshita Bhasin; Anushka Yadav; Michael Tiemann; Param Thakkar; Shrinivas Khedkar

arxiv: 2605.27050 · v1 · pith:H3XJJV2Enew · submitted 2026-05-26 · 💻 cs.CL · cs.LG

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

Param Thakkar , Anushka Yadav , Michael Tiemann , Abhi Mehta , Akshita Bhasin , Shrinivas Khedkar This is my paper

Pith reviewed 2026-06-29 18:10 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords machine translationlow-resource languagesparallel corpusdata deduplicationEnglish-MarathiNMTLoRAdata preprocessing

0 comments

The pith

Corpus-level deduplication delivers the largest quality gain for English-Marathi neural machine translation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BhashaSetu, a 2.78 million sentence pair English-Marathi dataset drawn from news, politics, healthcare, literature and culture. It shows through ablations that cleaning duplicates at the corpus level improves translation metrics more than other preprocessing choices. A sympathetic reader would care because this points to a simple, inexpensive way to raise performance for languages with limited high-quality data. The work also releases the dataset with linguistic enrichments like stems and lemmas to support further research on morphologically complex languages.

Core claim

BhashaSetu provides 2.78 million English-Marathi sentence pairs enriched with stemmed and lemmatized forms. When used to fine-tune the NLLB-200-distilled-600M model via LoRA, the dataset yields competitive translation results across BLEU, spBLEU, chrF++ and TER. The central discovery is that omitting corpus-level deduplication causes the largest drop in performance, specifically 1.17 BLEU and 2.21 chrF++, establishing cross-source deduplication as the dominant preprocessing factor for this low-resource setting.

What carries the argument

Corpus-level deduplication of sentence pairs collected from heterogeneous public sources, shown through ablation to be the dominant factor improving downstream NMT quality.

If this is right

For morphologically rich low-resource languages, data hygiene steps like deduplication yield measurable gains in translation accuracy.
Parameter-efficient methods such as LoRA enable effective adaptation of large multilingual models to new language pairs using the released corpus.
Releasing linguistically annotated parallel data supports reproducible experiments and morphology-aware analysis.
Disciplined preprocessing can substitute for larger data volumes in resource-constrained translation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar deduplication strategies may benefit other language pairs where data is scraped from multiple domains.
The emphasis on corpus hygiene suggests that data-centric approaches could outperform purely model-centric scaling in low-resource scenarios.
The dataset's domain diversity opens opportunities to study cross-domain transfer in machine translation.
Extending the ablation to other preprocessing steps or languages would test the generality of the deduplication finding.

Load-bearing premise

The sentence pairs gathered from various public sources form sufficiently accurate and parallel translations without substantial systematic errors or domain mismatches.

What would settle it

Reproducing the ablation study on the public dataset and observing no performance difference or an increase when deduplication is skipped would falsify the claim that deduplication is the largest contributor.

Figures

Figures reproduced from arXiv: 2605.27050 by Abhi Mehta, Akshita Bhasin, Anushka Yadav, Michael Tiemann, Param Thakkar, Shrinivas Khedkar.

**Figure 2.** Figure 2: Batch-based text processing workflow Batch execution enables scalable preprocessing while preserving language-specific transformations. tically well-formed and translationally valid. Similarly, certain formulaic or conversational genres produce legitimate sentences below 3 words. By excluding these, the corpus may underrepresent stylistic and domain-specific variation that is important for building robu… view at source ↗

**Figure 3.** Figure 3: Domain distribution of the dataset. The corpus spans multiple domains, with the largest share concentrated [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves for LoRA-based fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation impact relative to the full pipeline. Bars show [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BhashaSetu releases a 2.78M English-Marathi corpus and shows deduplication as the biggest preprocessing gain, but offers no validation of pair quality.

read the letter

The one thing to know is that this paper releases a 2.78 million sentence-pair English-Marathi corpus from news, politics, healthcare, literature and culture, and reports that skipping corpus-level deduplication drops performance by 1.17 BLEU and 2.21 chrF++.

They add stemmed and lemmatized versions, benchmark several models on BLEU, spBLEU, chrF++ and TER, and run LoRA fine-tuning on NLLB-200-distilled-600M. The ablation isolates preprocessing steps and identifies deduplication as the largest single contributor.

Releasing the full dataset publicly is the useful part. Anyone working on Marathi or other low-resource Indian languages now has a sizable multi-domain resource they can actually use, and the quantified hygiene result is a low-cost finding others can replicate.

The soft spot is the absence of any reported check on the parallel quality of the collected pairs. The sources are heterogeneous public ones, yet the text gives no human validation, alignment scoring, or error sampling. If misalignment or noise levels differ across the sources, the deduplication delta could partly reflect removal of bad pairs rather than duplication per se. No statistical significance or variance numbers are mentioned either.

This is for people who need Marathi data right now or who study data-centric methods in MT. The empirical preprocessing claim is testable once the data is out, so the paper deserves a serious referee even if the quality controls need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces BhashaSetu, a 2.78M English–Marathi parallel corpus compiled from heterogeneous public sources (news, politics, healthcare, literature, culture) and augmented with stemmed/lemmatized forms. It reports benchmarks of multiple NMT systems under BLEU/spBLEU/chrF++/TER, describes LoRA fine-tuning of NLLB-200-distilled-600M, and presents an ablation showing that corpus-level deduplication is the single largest preprocessing contributor (its removal drops performance by 1.17 BLEU and 2.21 chrF++). The dataset is released publicly.

Significance. If the parallel pairs are verifiably high-quality, the work supplies a sizable, domain-diverse resource for an underrepresented language and supplies concrete evidence that disciplined deduplication yields larger gains than other common preprocessing steps in morphologically rich low-resource settings. Public release of the corpus directly supports reproducibility.

major comments (2)

[Corpus construction section] Corpus construction (the section describing collection of the 2.78M pairs): no human validation, automatic alignment scoring, or error-rate sampling of the final corpus is reported. Because the central ablation treats these pairs as reliable gold parallels, the observed 1.17 BLEU / 2.21 chrF++ drop when deduplication is omitted could be driven by differential noise amplification rather than duplication per se.
[Ablation study section] Ablation study (the section reporting the deduplication result): the experiment removes only deduplication while keeping all other preprocessing fixed, but does not report whether the non-deduplicated corpus was re-balanced for domain or length distribution; without that control it is unclear whether the metric deltas are attributable solely to duplicate removal.

minor comments (2)

[Abstract] The abstract states concrete metric deltas but does not indicate whether the reported BLEU/chrF++ figures are averages over multiple random seeds or single runs; adding this detail would strengthen the ablation claim.
[Results section] Table or figure presenting the ablation results should include the full set of preprocessing variants tested so readers can verify that deduplication indeed ranks as the largest single contributor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on corpus validation and experimental controls. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Corpus construction section] Corpus construction (the section describing collection of the 2.78M pairs): no human validation, automatic alignment scoring, or error-rate sampling of the final corpus is reported. Because the central ablation treats these pairs as reliable gold parallels, the observed 1.17 BLEU / 2.21 chrF++ drop when deduplication is omitted could be driven by differential noise amplification rather than duplication per se.

Authors: We acknowledge that the original manuscript did not report human validation, automatic alignment scoring, or error-rate sampling. The corpus was assembled from heterogeneous public sources with the assumption of reasonable quality, but this leaves open the possibility that noise differences contribute to the ablation results. In the revised manuscript we will add automatic alignment scores (e.g., via LASER) for the full corpus and results from manual inspection of a random sample of 500 pairs, including the estimated error rate. We will also expand the discussion to note that some portion of the observed 1.17 BLEU drop could arise from differential noise amplification and that future work should further disentangle these factors. revision: yes
Referee: [Ablation study section] Ablation study (the section reporting the deduplication result): the experiment removes only deduplication while keeping all other preprocessing fixed, but does not report whether the non-deduplicated corpus was re-balanced for domain or length distribution; without that control it is unclear whether the metric deltas are attributable solely to duplicate removal.

Authors: The ablation intentionally kept all other preprocessing steps identical to isolate the effect of deduplication. The non-deduplicated corpus was therefore not re-balanced. In the revision we will report domain and sentence-length distributions for both versions to allow readers to assess whether distributional shifts beyond duplicate removal are present. If imbalances exist we will note them explicitly as a limitation of the current experimental design. revision: partial

Circularity Check

0 steps flagged

Empirical ablation on new corpus exhibits no circular derivation

full rationale

The paper reports an empirical study of dataset curation followed by ablation experiments that measure downstream NMT performance (BLEU, chrF++) with and without corpus-level deduplication. These results are obtained by direct training and evaluation on held-out test data using standard metrics; no equations, fitted parameters, or self-citations are invoked to derive the reported deltas. The central claim therefore rests on observable performance differences rather than any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that automatically collected parallel sentences constitute reliable training and evaluation data and that standard MT metrics reflect meaningful quality differences.

axioms (1)

domain assumption BLEU, spBLEU, chrF++ and TER are appropriate and sufficient metrics for comparing translation quality in this setting.
Standard practice in NMT but known to correlate imperfectly with human judgments of meaning and fluency.

pith-pipeline@v0.9.1-grok · 5748 in / 1222 out tokens · 59846 ms · 2026-06-29T18:10:04.160398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2601.09012

Translategemma technical report.Preprint, arXiv:2601.09012. Jay Gala, Pranjal A. Chitale, A K Raghavan, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar M, Janki Atul Nawale, Anupama Sujatha, Ratish Pudup- pully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, and Anoop Kunchukuttan. 2023. Indictrans2: Towards high-quality and accessible ma- c...

work page arXiv 2023
[2]

InProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 114–123, Virtual

Findings of the loresmt 2021 shared task on covid and sign language for low-resource languages. InProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 114–123, Virtual. Association for Machine Translation in the Americas. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for...

work page arXiv 2021

[1] [1]

arXiv preprint arXiv:2601.09012

Translategemma technical report.Preprint, arXiv:2601.09012. Jay Gala, Pranjal A. Chitale, A K Raghavan, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar M, Janki Atul Nawale, Anupama Sujatha, Ratish Pudup- pully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, and Anoop Kunchukuttan. 2023. Indictrans2: Towards high-quality and accessible ma- c...

work page arXiv 2023

[2] [2]

InProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 114–123, Virtual

Findings of the loresmt 2021 shared task on covid and sign language for low-resource languages. InProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 114–123, Virtual. Association for Machine Translation in the Americas. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for...

work page arXiv 2021