Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan; Christan Grant; Christopher Driggers-Ellis; Daisy Zhe Wang

arxiv: 2601.03135 · v2 · pith:Q5FOC4DZnew · submitted 2026-01-06 · 💻 cs.CL

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan , Christopher Driggers-Ellis , Christan Grant , Daisy Zhe Wang This is my paper

Pith reviewed 2026-05-21 15:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translationindigenous languagessynthetic datalow-resource NMTGuaraniQuechuaAymaramBART fine-tuning

0 comments

The pith

Synthetic data from multilingual models raises chrF++ scores for Guarani-Spanish and Quechua-Spanish translation when added to small curated corpora.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generating extra sentence pairs with a high-capacity multilingual model can compensate for the extreme scarcity of parallel text in indigenous languages of the Americas. It fine-tunes mBART on both the original curated sets and the augmented versions, then measures quality with chrF++ after applying orthographic normalization and noise-aware filtering. Results show steady gains for Guarani and Quechua, while Aymara experiments reveal that the same generic preprocessing steps are insufficient for highly agglutinative morphology. This matters because many indigenous languages have only hundreds or thousands of usable sentence pairs, so any reliable augmentation technique could make usable translation systems feasible where none exist today.

Core claim

Augmenting curated parallel datasets for Guarani-Spanish and Quechua-Spanish with synthetic sentence pairs produced by a high-capacity multilingual translation model yields consistent chrF++ improvements after fine-tuning mBART, while language-specific preprocessing steps reduce corpus artifacts; diagnostic runs on Aymara show that generic preprocessing remains inadequate for languages with extreme agglutination.

What carries the argument

Synthetic sentence-pair generation from a high-capacity multilingual model, followed by language-specific preprocessing (orthographic normalization and noise-aware filtering) before mBART fine-tuning.

If this is right

Synthetic augmentation produces measurable chrF++ gains on Guarani-Spanish and Quechua-Spanish pairs.
Standard preprocessing pipelines are insufficient for highly agglutinative languages such as Aymara.
High-capacity multilingual models can supply usable training material for low-resource language pairs when domain overlap exists.
Fine-tuning mBART on the combined curated-plus-synthetic sets is a practical route for data-scarce Americas languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same augmentation strategy could be tested on other low-resource language families facing comparable data shortages.
Language-specific morphological analyzers or normalization rules may be required before synthetic data can help agglutinative cases like Aymara.
Successful application would reduce the barrier to building basic translation tools that support language documentation and everyday use.

Load-bearing premise

The synthetic pairs must be sufficiently accurate and domain-relevant that their addition does not introduce systematic errors or biases that lower performance on the target indigenous languages.

What would settle it

A drop in chrF++ on held-out Guarani or Quechua test sets after the synthetic data is added would show that the generated pairs degrade rather than improve the model.

read the original abstract

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Synthetic augmentation plus preprocessing lifts scores on Guarani and Quechua but the unverified quality of the generated pairs and weak Aymara results are the main gaps.

read the letter

The main takeaway is that synthetic data from a large multilingual model, when combined with orthographic normalization and noise filtering, produces consistent chrF++ gains for Guarani-Spanish and Quechua-Spanish. The Aymara experiments serve as a diagnostic that generic preprocessing has limits for highly agglutinative languages. What is new is the targeted combination for these three language pairs. The authors generate synthetic pairs, mix them with curated data, apply the language-specific steps, and fine-tune mBART. This is not a novel algorithm but a practical recipe tailored to the AmericasNLP setting. The paper does a good job keeping the claims grounded. It reports improvements where they occur and uses the chrF++ metric that fits the task. The inclusion of the Aymara case shows they are not just chasing positive results. The soft spot is the missing validation on the synthetic data itself. No quality metrics, human judgments, or controls for how the synthetic pairs match the target domain and morphology are mentioned. In low-resource agglutinative languages, this leaves open the possibility that gains come from volume or from noise that happens to help the metric. The abstract is short on numbers, so the full version needs clear tables and ablations. This paper is for the community working on machine translation for indigenous languages of the Americas. Someone looking for augmentation techniques to try on similar pairs will find the preprocessing details useful. It has enough substance and honest diagnostics to deserve a serious referee. I would recommend sending it to peer review rather than a desk reject, with the expectation that reviewers will push for synthetic data quality checks.

Referee Report

2 major / 2 minor

Summary. The manuscript describes an empirical approach to low-resource NMT for indigenous languages of the Americas. Curated parallel corpora for Guarani-Spanish and Quechua-Spanish are augmented with synthetic sentence pairs produced by a high-capacity multilingual model; an mBART model is fine-tuned on both the original and augmented data. Language-specific preprocessing (orthographic normalization and noise-aware filtering) is applied. The primary evaluation metric is chrF++. Experiments report consistent chrF++ gains from the synthetic augmentation for the two main pairs, while diagnostic runs on Aymara illustrate shortcomings of generic preprocessing for highly agglutinative languages.

Significance. If the central empirical claim is substantiated with quantitative controls and synthetic-data quality verification, the work would supply a concrete, replicable recipe for data augmentation in extremely low-resource agglutinative settings and would directly support ongoing AmericasNLP shared-task efforts. The Aymara diagnostic is a useful negative result that highlights the need for morphology-aware preprocessing.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): directional chrF++ improvements are stated without any numerical deltas, baseline scores, standard deviations, or statistical significance tests. This absence prevents assessment of effect size and reproducibility of the headline result.
[§3.2] §3.2 (Synthetic Data Generation): no automatic or human quality metrics are supplied for the generated pairs, nor is there an ablation that isolates the contribution of data volume versus data fidelity. In highly inflected languages, even modest rates of morphological hallucination or domain shift can produce spurious gains; the current design cannot rule this out.

minor comments (2)

[§3.2] Clarify the exact mixing ratios and filtering thresholds used for synthetic data; these appear as free parameters but are not tabulated.
[§3.1] Provide at least one concrete example of an orthographic normalization rule and a noise filter decision for each language pair.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): directional chrF++ improvements are stated without any numerical deltas, baseline scores, standard deviations, or statistical significance tests. This absence prevents assessment of effect size and reproducibility of the headline result.

Authors: We agree that the absence of concrete numbers, baselines, standard deviations, and significance tests limits evaluation of the results. In the revised manuscript we will update both the abstract and §4 to report the exact chrF++ scores for the curated-only baseline and the synthetically augmented models on Guarani-Spanish and Quechua-Spanish, include standard deviations where multiple runs were performed, and add statistical significance tests. These additions will be presented in tables and summarized in the text. revision: yes
Referee: [§3.2] §3.2 (Synthetic Data Generation): no automatic or human quality metrics are supplied for the generated pairs, nor is there an ablation that isolates the contribution of data volume versus data fidelity. In highly inflected languages, even modest rates of morphological hallucination or domain shift can produce spurious gains; the current design cannot rule this out.

Authors: We acknowledge that direct quality verification of the synthetic pairs was not reported. We will add automatic quality metrics (e.g., chrF++ of back-translated synthetic sentences against held-out references and perplexity under a language model) to the revised §3.2. Our primary experimental contrast (curated-only vs. curated+ synthetic) already isolates the effect of adding the synthetic data while keeping the original parallel data fixed; we will clarify this design choice and discuss its limitations with respect to volume versus fidelity. We will also expand the discussion of possible morphological hallucination risks, using the Aymara diagnostic results as supporting evidence that generic augmentation can fail in highly agglutinative settings. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical augmentation and evaluation

full rationale

The work consists of data augmentation experiments using a high-capacity multilingual model to generate synthetic pairs, followed by fine-tuning mBART and evaluation on chrF++ for Guarani-Spanish, Quechua-Spanish, and diagnostic Aymara cases. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear. The chrF++ metric is adopted from prior shared-task literature rather than derived internally. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The central claims rest on reported experimental deltas, which remain falsifiable against external benchmarks and do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that synthetic data from a generic multilingual model transfers usefully to these specific low-resource pairs and that chrF++ adequately captures translation quality for agglutinative languages.

free parameters (2)

Synthetic data volume and mixing ratio
The number or proportion of synthetic pairs added to the curated set is chosen experimentally and not derived from first principles.
Noise-filtering thresholds
Exact cutoffs for removing noisy examples are not specified and appear tuned to the datasets.

axioms (2)

domain assumption chrF++ is the appropriate primary automatic metric for these agglutinative language pairs
Invoked because it was used in recent AmericasNLP shared tasks.
domain assumption The high-capacity multilingual model produces synthetic pairs whose distribution is close enough to real data to be beneficial
Implicit in the decision to augment with its output.

pith-pipeline@v0.9.0 · 5684 in / 1335 out tokens · 91661 ms · 2026-05-21T15:40:32.116122+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We augment curated parallel datasets... with synthetic sentence pairs generated using a high-capacity multilingual translation model... evaluate translation quality using chrF++

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.