NiuTrans.LMT: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

Bei Li; Dingyang Lin; Jingbo Zhu; Kaiyan Chang; Murun Yang; Peinan Feng; Quan Du; Tong Xiao; Tong Zheng; yingfeng luo

arxiv: 2511.07003 · v2 · submitted 2025-11-10 · 💻 cs.CL

NiuTrans.LMT: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

yingfeng luo , Ziqiang Xu , Yuxuan Ouyang , Murun Yang , Dingyang Lin , Kaiyan Chang , Tong Zheng , Bei Li

show 4 more authors

Peinan Feng Quan Du Tong Xiao Jingbo Zhu

This is my paper

Pith reviewed 2026-05-17 23:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual machine translationdirectional degenerationstrategic downsamplingparallel multilingual promptinglarge language modelsopen-source MMTcross-lingual transfer

0 comments

The pith

Directional degeneration from many-to-one mappings in pivot-centered training data can be mitigated by strategic downsampling and parallel multilingual prompting to produce competitive multilingual translation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies directional degeneration as a failure mode in multilingual supervised fine-tuning on multi-way parallel data reused symmetrically around a pivot language such as English. It attributes the drop in reverse-direction performance to excessive many-to-one mappings that encourage shortcut learning. To counter this, the authors introduce strategic downsampling to balance training distributions and parallel multilingual prompting to strengthen cross-lingual transfer during training and at test time. They then train the NiuTrans.LMT suite of models in four sizes spanning 60 languages and 234 directions. Evaluations show the resulting models are competitive with other open-source multilingual systems, with the 4B variant matching or exceeding substantially larger baselines.

Core claim

Directional degeneration arises when multi-way parallel data are reused symmetrically around a pivot, creating excessive many-to-one mappings that promote shortcut learning and degrade performance on reverse directions; strategic downsampling restores balance in the data distribution while parallel multilingual prompting supplies auxiliary parallel sentences to improve cross-lingual transfer, together enabling the NiuTrans.LMT models to deliver robust quality across 234 directions.

What carries the argument

Strategic Downsampling (SD) to rebalance many-to-one mappings plus Parallel Multilingual Prompting (PMP) to augment instructions with auxiliary parallel sentences, which together counteract directional degeneration in pivot-centered multilingual training.

If this is right

The 4B LMT model performs on par with or better than substantially larger open-source baselines.
The full LMT suite remains competitive with existing open-source multilingual machine translation systems.
The models successfully cover 60 languages and 234 translation directions at four parameter scales.
Public release of the models and resources directly supports broader access to scalable multilingual translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balancing and prompting techniques could be tested on other pivot-centered multilingual tasks such as summarization or retrieval to check for similar degeneration patterns.
If the methods generalize, they may allow mid-sized models to close the gap with much larger ones without proportional increases in compute.
Extending the approach beyond the current Chinese-English centric setup to additional high-resource pivots would test whether the core mechanism is pivot-specific.

Load-bearing premise

Directional degeneration is caused primarily by excessive many-to-one mappings that encourage shortcut learning, and strategic downsampling together with parallel multilingual prompting can reliably reduce it across all 234 directions without creating offsetting quality losses.

What would settle it

An evaluation in which the 4B LMT model shows consistent and substantial underperformance relative to larger baselines specifically on multiple X-to-English reverse directions would indicate that the mitigation methods did not succeed as claimed.

Figures

Figures reproduced from arXiv: 2511.07003 by Bei Li, Dingyang Lin, Jingbo Zhu, Kaiyan Chang, Murun Yang, Peinan Feng, Quan Du, Tong Xiao, Tong Zheng, yingfeng luo, Yuxuan Ouyang, Ziqiang Xu.

**Figure 1.** Figure 1: Top: Performance of base LLMs (orange) on the Belebele benchmark across 108 languages, plotted against their data ratios in the CulturaX (blue). Bottom: Bilingual data volume (million sentence pairs) from the OPUS corpus for 60 languages in our study, covering English-centric (blue) and Chinese-centric (orange) directions. Languages are grouped into high-, medium-, and low-resource tiers. data, and propos… view at source ↗

**Figure 2.** Figure 2: An overview of our methodology for LMT. The pipeline consists of two main stages: a hybrid data [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of the three prompt formats for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The impact of the Strategic Downsampling [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Analysis of the Parallel Multilingual Prompt [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Generality analysis of Directional Degeneration across multiple foundation models. The x-axis represents [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: The impact of multilingual scale (number of languages) on the Directional Degeneration. The x-axis [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Performance improvements brought by Continued Pre-training (CPT). Languages are grouped by resource [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: COMETKiwi score distributions for bilingual sentence pairs (En-X) are shown as histograms. Vertical [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: COMETKiwi score distributions for bilingual sentence pairs (Zh-X) are shown as histograms. Vertical [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Large language models have significantly advanced Multilingual Machine Translation (MMT), yet scaling to many languages while keeping quality robust across directions remains challenging. In this paper, we identify a failure mode of multilingual supervised fine-tuning (SFT) on multi-way parallel data: when such data are reused symmetrically around a pivot language (e.g., English), performance on reverse directions (X $\to$ pivot) can drop substantially. We term this phenomenon Directional Degeneration and attribute it to excessive many-to-one mappings, which encourage shortcut learning. We propose Strategic Downsampling (SD), a simple yet effective method to mitigate this degeneration. In addition, we introduce Parallel Multilingual Prompting (PMP), which augments translation instructions with an auxiliary parallel sentence to promote cross-lingual transfer during training and enables optional test-time enhancement when auxiliary translations are available. We further develop \textbf{NiuTrans.LMT} (\textbf{L}arge-scale \textbf{M}ultilingual \textbf{T}ranslation, abbreviated as \textbf{LMT}), a Chinese-English-centric suite of multilingual translation models spanning four sizes (0.6B/1.7B/4B/8B) and covering 60 languages and 234 directions. Comprehensive evaluations show that LMT is competitive among open-source MMT systems, and that our 4B LMT model performs on par with or better than substantially larger baselines. We release our models and project resources to support inclusive and scalable MMT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names Directional Degeneration in multilingual SFT and offers two practical mitigations plus released models, but the causal attribution rests on observation rather than a controlled isolation of the mechanism.

read the letter

The main thing to know is that this work flags a concrete training issue when multi-way parallel data gets reused symmetrically around English: performance on the reverse directions drops, which they label Directional Degeneration and link to too many many-to-one mappings pushing the model toward shortcuts. They respond with Strategic Downsampling to adjust sampling ratios and Parallel Multilingual Prompting to add auxiliary parallels during training, then build and release the LMT family (0.6B to 8B) across 60 languages and 234 directions, claiming the 4B version stays competitive with much larger open-source baselines.

Referee Report

2 major / 2 minor

Summary. The paper introduces NiuTrans.LMT, a suite of LLM-based multilingual machine translation models (0.6B/1.7B/4B/8B) covering 60 languages and 234 directions. It identifies Directional Degeneration in multilingual SFT on symmetrically reused multi-way parallel data around English, attributes the reverse-direction drops to shortcut learning from excessive many-to-one mappings, and proposes Strategic Downsampling (SD) plus Parallel Multilingual Prompting (PMP) as mitigations. The central empirical claim is that the 4B LMT model is competitive among open-source MMT systems and on par with or better than substantially larger baselines.

Significance. If the reported competitiveness holds after verification and the proposed SD+PMP methods prove robust without hidden trade-offs, the work would supply practical open models and training heuristics that advance inclusive, scalable MMT. The model releases themselves constitute a concrete contribution to reproducibility.

major comments (2)

[§4] §4 (Experimental Setup and Results): the claim that the 4B LMT performs on par with or better than substantially larger baselines is load-bearing for the paper's main contribution, yet the manuscript provides neither the full comparison table with per-direction scores, baseline parameter counts, nor any ablation table isolating SD and PMP from other training choices.
[§3] §3 (Directional Degeneration and Proposed Methods): the attribution of performance drops to many-to-one mappings and the assertion that SD plus PMP reliably mitigate it across 234 directions rest on observational correlation; no controlled experiment that varies mapping multiplicity while holding total tokens fixed is described, leaving open whether these interventions are causal or merely correlated with other factors.

minor comments (2)

[Abstract] Abstract: the phrase 'comprehensive evaluations' is used without naming the test sets, metrics, or number of directions actually reported.
[Method] Notation: the description of PMP does not clarify whether the auxiliary parallel sentence is drawn from the same multi-way bundle or from an external source, which affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying our experimental evidence and outlining revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup and Results): the claim that the 4B LMT performs on par with or better than substantially larger baselines is load-bearing for the paper's main contribution, yet the manuscript provides neither the full comparison table with per-direction scores, baseline parameter counts, nor any ablation table isolating SD and PMP from other training choices.

Authors: We agree that expanded result tables would improve transparency and verifiability. In the revised manuscript we will add a full comparison table in §4 (or a dedicated appendix) that reports per-direction scores for the 4B LMT model against all baselines, together with explicit parameter counts for every baseline system. We will also include a dedicated ablation table that isolates SD and PMP by training otherwise identical models with and without each component. These additions will directly support the competitiveness claim and allow readers to assess the individual contributions of the proposed methods. revision: yes
Referee: [§3] §3 (Directional Degeneration and Proposed Methods): the attribution of performance drops to many-to-one mappings and the assertion that SD plus PMP reliably mitigate it across 234 directions rest on observational correlation; no controlled experiment that varies mapping multiplicity while holding total tokens fixed is described, leaving open whether these interventions are causal or merely correlated with other factors.

Authors: We recognize that a strictly controlled experiment varying mapping multiplicity while exactly fixing total token count would provide stronger causal evidence. Our current support rests on consistent observational patterns observed across four model scales and all 234 directions, where reverse-direction degradation appears reliably when symmetric multi-way data are used and is alleviated by SD and PMP. In revision we will add a discussion subsection in §3 that explicitly acknowledges this limitation, reports additional token-count statistics under different downsampling ratios, and outlines a feasible protocol for future controlled studies. We maintain that the large-scale empirical gains demonstrated in the paper still establish the practical utility of SD and PMP for inclusive MMT. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical methods and evaluations are self-contained

full rationale

The paper identifies Directional Degeneration from observational performance drops when reusing multi-way parallel data symmetrically around English, attributes it to many-to-one mappings encouraging shortcut learning, and introduces Strategic Downsampling plus Parallel Multilingual Prompting as mitigations. These are presented as practical training adjustments derived from the observed issue, followed by standard held-out evaluations across 234 directions showing competitive results for the 4B model. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the performance claims to quantities defined by the same test data or to unverified self-referential premises. The derivation chain relies on external benchmarks and empirical reporting rather than definitional loops or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about data balance and cross-lingual transfer rather than new mathematical axioms or invented physical entities.

pith-pipeline@v0.9.0 · 5603 in / 1101 out tokens · 43336 ms · 2026-05-17T23:41:03.823162+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We term this phenomenon Directional Degeneration and attribute it to excessive many-to-one mappings, which encourage shortcut learning. We propose Strategic Downsampling (SD)... Parallel Multilingual Prompting (PMP)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LMT... covering 60 languages and 234 translation directions... 4B LMT model performs on par with or better than substantially larger baselines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories
cs.CL 2026-04 unverdicted novelty 7.0

BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper

[1]

InProceedings of the 58th An- nual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 150–156

OpusFilter: A configurable parallel corpus filtering toolbox. InProceedings of the 58th An- nual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 150–156. Association for Computational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shy...

work page arXiv 2020
[2]

https://arxiv.org/abs/2503.06594

Beyond decoder-only: Large language mod- els can be good encoders for machine translation. Preprint, arXiv:2503.06594. Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2023. Cul- turax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages.P...

work page arXiv 2023
[3]

InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Au- gust 11-16, 2024, pages 7175–7187

Benchmarking and improving long-text trans- lation with large language models. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Au- gust 11-16, 2024, pages 7175–7187. Association for Computational Linguistics. Yutong Wang, Jiali Zeng, Xuebo Liu, Derek F. Wong, Fandong Meng, Jie Zhou, and Min Zha...

work page 2024
[4]

stitched

Skywork: A more open bilingual foundation model.Preprint, arXiv:2310.19341. Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has- san Awadalla. 2024. A paradigm shift in machine translation: Boosting translation performance of large language models. InICLR. OpenReview.net. Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, and Huda Khayral...

work page arXiv 2024
[5]

我们现在有4个月大没有糖尿病的老鼠，但它们曾经得过该病。

Wanjuansilu: A high-quality open-source web- text dataset for low-resource languages.Preprint, arXiv:2501.14506. Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen- nrich. 2020. Improving massively multilingual neu- ral machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Lingui...

work page arXiv 2020

[1] [1]

InProceedings of the 58th An- nual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 150–156

OpusFilter: A configurable parallel corpus filtering toolbox. InProceedings of the 58th An- nual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 150–156. Association for Computational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shy...

work page arXiv 2020

[2] [2]

https://arxiv.org/abs/2503.06594

Beyond decoder-only: Large language mod- els can be good encoders for machine translation. Preprint, arXiv:2503.06594. Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2023. Cul- turax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages.P...

work page arXiv 2023

[3] [3]

InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Au- gust 11-16, 2024, pages 7175–7187

Benchmarking and improving long-text trans- lation with large language models. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Au- gust 11-16, 2024, pages 7175–7187. Association for Computational Linguistics. Yutong Wang, Jiali Zeng, Xuebo Liu, Derek F. Wong, Fandong Meng, Jie Zhou, and Min Zha...

work page 2024

[4] [4]

stitched

Skywork: A more open bilingual foundation model.Preprint, arXiv:2310.19341. Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has- san Awadalla. 2024. A paradigm shift in machine translation: Boosting translation performance of large language models. InICLR. OpenReview.net. Haoran Xu, Kenton Murray, Philipp Koehn, Hieu Hoang, Akiko Eriguchi, and Huda Khayral...

work page arXiv 2024

[5] [5]

我们现在有4个月大没有糖尿病的老鼠，但它们曾经得过该病。

Wanjuansilu: A high-quality open-source web- text dataset for low-resource languages.Preprint, arXiv:2501.14506. Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen- nrich. 2020. Improving massively multilingual neu- ral machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Lingui...

work page arXiv 2020