pith. sign in

arxiv: 2604.19778 · v1 · submitted 2026-03-28 · 💻 cs.CL

Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

Pith reviewed 2026-05-14 21:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords machine translationlow-resource languagesKokborokNLLBneural machine translationback-translationsynthetic dataTibeto-Burman
0
0 comments X

The pith

Fine-tuning NLLB with 25k synthetic pairs lifts Kokborok translation BLEU to 17.30 and 38.56.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds KokborokMT, the first usable neural machine translation system for Kokborok, a Tibeto-Burman language spoken by roughly 1.5 million people in Tripura, India. Prior systems trained only on Bible text never exceeded BLEU 7. The authors add a Kokborok language token to the NLLB-200-distilled-600M model and fine-tune it on a mixed corpus of 9k professional translations, 1.7k Bible sentences, and 25k synthetic English-to-Kokborok pairs created by back-translating Tatoeba sentences with Gemini Flash. The resulting model reaches BLEU 17.30 and 38.56 on held-out tests and receives human adequacy and fluency scores near 3.7 out of 5. The work demonstrates that modest amounts of high-quality synthetic data can make modern multilingual models practical for severely under-resourced languages.

Core claim

The central claim is that introducing a Kokborok language token and fine-tuning NLLB-200-distilled-600M on 36,052 sentence pairs (9,284 professional SMOL translations, 1,769 Bible sentences, and 24,999 Gemini Flash back-translations) produces BLEU scores of 17.30 and 38.56 on held-out test sets while prior Bible-only systems stayed below 7, with human annotators rating adequacy at 3.74/5 and fluency at 3.70/5.

What carries the argument

The addition of a new Kokborok language token to NLLB-200 followed by fine-tuning on a multi-source corpus that mixes limited professional data with large-scale synthetic back-translations generated by Gemini Flash.

Load-bearing premise

The 24,999 synthetic back-translated sentence pairs generated by Gemini Flash are of high enough quality that they improve rather than degrade the final model.

What would settle it

Retraining the same NLLB model on only the 11,053 real sentence pairs and checking whether BLEU drops below 17 or human adequacy and fluency scores fall below 3.5.

Figures

Figures reproduced from arXiv: 2604.19778 by Badal Nyalang, Biman Debbarma.

Figure 1
Figure 1. Figure 1: Left: Training and validation loss curves for System 1 (no synthetic data, red) and System 2 (full pipeline, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce as a new language token for Kokborok in the NLLB framework. Our best system achieves BLEU scores of 17.30 and 38.56 on held-out test sets, representing substantial improvements over prior published results. Human evaluation by three annotators yields mean adequacy of 3.74/5 and fluency of 3.70/5, with substantial agreement between trained evaluators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents KokborokMT, a neural machine translation system for the low-resource Kokborok language. It fine-tunes the NLLB-200-distilled-600M model on a 36,052-sentence-pair corpus (9,284 SMOL professional translations + 1,769 Bible sentences + 24,999 Gemini Flash synthetic back-translations from Tatoeba English), introduces a new Kokborok language token, and reports BLEU scores of 17.30 and 38.56 on held-out test sets together with human adequacy (3.74/5) and fluency (3.70/5) ratings from three annotators, claiming substantial gains over prior Bible-only baselines below 7 BLEU.

Significance. If the results hold, the work provides a valuable first high-quality MT baseline for Kokborok, a Tibeto-Burman language with 1.5 million speakers that has seen almost no prior NLP attention. It demonstrates a practical recipe for low-resource settings by combining a strong multilingual base model with modest human data and synthetic augmentation. Credit is given for supplying both automatic metrics and human evaluation scores on held-out data.

major comments (2)
  1. [§4] §4 (Experiments and Results): The reported BLEU gains of 17.30 and 38.56 are obtained only after including the 24,999 Gemini Flash synthetic pairs; no ablation is presented that trains the identical NLLB-200-distilled-600M model on the human data alone (9,284 SMOL + 1,769 Bible pairs). This omission is load-bearing for the central claim that the full corpus yields high-quality translation.
  2. [§3.2] §3.2 (Corpus Construction): No quantitative or qualitative assessment of the 24,999 synthetic back-translated sentences is supplied (e.g., round-trip BLEU, human spot-checks on a sample, or error typology). Given that these pairs constitute the majority of the training data and that Kokborok has minimal pre-training exposure, their quality directly affects whether the observed improvements are attributable to the synthetic component or to the base model and human data.
minor comments (2)
  1. [Abstract] Abstract: The sizes, domains, and construction procedures for the two held-out test sets are not described, which limits assessment of the reported BLEU and human scores.
  2. [§3.1] §3.1: The exact procedure for inserting the new Kokborok language token into the NLLB vocabulary and tokenizer is stated only at a high level; a short description of any vocabulary extension or embedding initialization would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on KokborokMT. The comments highlight important aspects of experimental rigor and data quality that we will address in revision.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments and Results): The reported BLEU gains of 17.30 and 38.56 are obtained only after including the 24,999 Gemini Flash synthetic pairs; no ablation is presented that trains the identical NLLB-200-distilled-600M model on the human data alone (9,284 SMOL + 1,769 Bible pairs). This omission is load-bearing for the central claim that the full corpus yields high-quality translation.

    Authors: We agree that an ablation training on the human data alone (11,053 pairs) is needed to quantify the synthetic data's contribution. We will add this experiment to §4 in the revised manuscript, reporting BLEU scores for the identical NLLB-200-distilled-600M model fine-tuned only on the SMOL and Bible portions. revision: yes

  2. Referee: [§3.2] §3.2 (Corpus Construction): No quantitative or qualitative assessment of the 24,999 synthetic back-translated sentences is supplied (e.g., round-trip BLEU, human spot-checks on a sample, or error typology). Given that these pairs constitute the majority of the training data and that Kokborok has minimal pre-training exposure, their quality directly affects whether the observed improvements are attributable to the synthetic component or to the base model and human data.

    Authors: We acknowledge the need for synthetic data quality assessment. In the revised §3.2 we will add round-trip BLEU evaluation on a held-out Tatoeba subset and a qualitative error analysis of a 100-sentence sample reviewed by native Kokborok speakers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript reports empirical BLEU scores obtained by fine-tuning the NLLB-200-distilled-600M model on a fixed mixture of 9,284 SMOL human pairs, 1,769 Bible sentences, and 24,999 Gemini-generated back-translations, then evaluating on held-out test sets. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. The central results follow directly from standard supervised training and evaluation on external data splits; the synthetic-data quality assumption is an empirical claim open to ablation rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the pre-trained NLLB model transferring knowledge to Kokborok once a new language token is added and on the assumption that synthetic back-translations add net value.

axioms (1)
  • domain assumption A pre-trained multilingual model can be adapted to a new language by adding a language token and fine-tuning on parallel data.
    This is the standard transfer assumption invoked when the authors introduce the Kokborok token and fine-tune NLLB.
invented entities (1)
  • Kokborok language token no independent evidence
    purpose: Enable the NLLB model to recognize and generate Kokborok text.
    New token added to the vocabulary without external validation beyond the training run.

pith-pipeline@v0.9.0 · 5542 in / 1272 out tokens · 45484 ms · 2026-05-14T21:52:35.467433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie

    Association for Computational Linguistics. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch

  2. [2]

    InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 86–96

    Improving neural machine translation models with monolingual data. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 86–96. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of trans- lation edit ra...