pith. sign in

arxiv: 2606.30943 · v1 · pith:K7WUW3REnew · submitted 2026-06-29 · 💻 cs.CL

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

Pith reviewed 2026-07-01 01:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords Arabic-Russian translationparallel corpusscientific translationLLM fine-tuningmachine translation benchmarkLoRA adaptationmultilingual modelsknowledge transfer
0
0 comments X

The pith

A hybrid Arabic-Russian parallel corpus of scientific and general texts enables fine-tuned models to translate scientific content more accurately than zero-shot baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark consisting of roughly 27,000 sentence pairs drawn from scientific abstracts together with religion, news, and conversation texts. Fine-tuning three multilingual models with LoRA adapters shows that the largest model reaches higher automatic scores on translation quality measures when adapted to this data than when used without adaptation. A sympathetic reader would care because the work targets the practical problem of language barriers that slow the sharing of research findings between Arabic-speaking and Russian-speaking scientific communities.

Core claim

The paper claims that domain-specific fine-tuning on the introduced hybrid parallel corpus produces measurable gains in Arabic-to-Russian and Russian-to-Arabic translation of scientific abstracts, with the Qwen2.5-7B model adapted via QLoRA at rank 8 delivering the highest scores across BLEU, chrF, BERTScore, and COMET, and that few-shot prompting alone does not yield comparable gains.

What carries the argument

The hybrid parallel corpus of about 27,000 sentence pairs compiled from scientific abstracts mixed with general-domain texts, used as training data for LoRA-based adaptation of multilingual language models.

If this is right

  • Domain-specific fine-tuning becomes necessary for usable performance on scientific translation tasks in these languages.
  • The released models and corpus make direct knowledge exchange between the two research communities more feasible.
  • The approach aligns with goals for international research partnerships and innovation infrastructure.
  • Few-shot methods without adaptation are insufficient for this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid-corpus construction method could be tested on other language pairs that lack dedicated scientific translation resources.
  • Extending evaluation from abstracts to complete research articles would reveal whether the gains hold for longer, more technical texts.
  • Integration of the fine-tuned models into existing translation platforms could be measured by adoption rates among bilingual researchers.

Load-bearing premise

The mixture of scientific abstracts with general-domain texts is representative enough of real scientific writing in both languages that gains on automatic metrics will correspond to usable improvements for actual research content.

What would settle it

A side-by-side human judgment study on held-out full-length scientific papers that finds no reliable preference for the fine-tuned outputs over zero-shot outputs would undermine the claim that the reported metric gains reflect practical translation quality.

Figures

Figures reproduced from arXiv: 2606.30943 by M. K. Arabov.

Figure 1
Figure 1. Figure 1: Composition of the hybrid training corpus by source (total: 26,878 examples). 3.2 Model Selection We selected three multilingual large language models representing different ar￾chitectures and size regimes. mT5-base (580M). The mT5 (Multilingual T5) model [18] is a massively multilingual variant of the T5 architecture, pretrained on a Common Crawl￾based dataset covering 101 languages. We selected the base … view at source ↗
read the original abstract

Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which affects international collaboration and the progress of sustainability-related research. We present a benchmark for Arabic--Russian scientific translation. The benchmark includes a hybrid parallel corpus of about 27,000 sentence pairs, compiled from scientific abstracts and general-domain texts (religion, news, conversations). We fine-tune three multilingual language models -- mT5-base (580M parameters), NLLB-200-distilled-1.3B (1.3B), and Qwen2.5-7B-Instruct (7B) -- using LoRA with ranks 8, 16, 32, and 64. The Qwen2.5-7B model with QLoRA (rank 8) yields BLEU 23.15, chrF 43.89, BERTScore 0.906, and COMET 0.758. These are +4.36 BLEU and +0.051 COMET above the zero-shot baseline. Few-shot prompting with three examples does not improve performance, indicating that domain-specific fine-tuning is required. We release the models, the corpus, and the evaluation code. By lowering the language barrier for scientific texts, the work enables knowledge exchange between Arabic-speaking and Russian-speaking researchers. It contributes to sustainable partnerships (UN SDG 17) and innovation infrastructure (SDG 9), aligning with the conference's focus on technology-driven sustainable development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a hybrid Arabic–Russian parallel corpus of ~27k sentence pairs compiled from scientific abstracts plus general-domain texts (religion, news, conversations), then benchmarks LoRA/QLoRA fine-tuning of mT5-base, NLLB-200-distilled-1.3B and Qwen2.5-7B-Instruct on Arabic-to-Russian translation. It reports that QLoRA (rank 8) on Qwen2.5-7B reaches BLEU 23.15 / chrF 43.89 / BERTScore 0.906 / COMET 0.758, outperforming the zero-shot baseline by +4.36 BLEU and +0.051 COMET, while few-shot prompting yields no gain; the authors release the corpus, models and evaluation code and position the work as enabling sustainable scientific knowledge transfer aligned with SDGs 9 and 17.

Significance. If the metric gains are shown to hold on a purely scientific test subset, the released resources would constitute a concrete, reproducible contribution to low-resource scientific translation between two major languages of research communication. The explicit release of models, corpus and code is a clear strength that supports follow-on work.

major comments (2)
  1. [Corpus and evaluation setup (Section 3 / 4)] Corpus and evaluation setup (Section 3 / 4): the manuscript states that the corpus is hybrid but supplies neither the fraction of scientific abstracts in the overall collection nor any indication that the test split is restricted to scientific material. No per-domain metric tables or breakdowns are provided. Because the central claim concerns scientific knowledge transfer, the absence of this information leaves open the possibility that reported gains are driven by easier general-domain examples, directly weakening the link between the empirical results and the stated motivation.
  2. [Results section] Results section: the paper reports point estimates for BLEU, chrF, BERTScore and COMET but does not report statistical significance, standard deviation across random seeds, or confidence intervals for the +4.36 BLEU / +0.051 COMET deltas. Given that the headline claim rests on these specific improvements, the lack of significance testing is a load-bearing omission.
minor comments (2)
  1. [Abstract] Abstract: the sentence describing the corpus should explicitly state the approximate proportion of scientific abstracts versus general-domain text so readers can immediately assess domain balance.
  2. [Results tables] Table captions and axis labels in the results tables should indicate whether scores are computed on the full hybrid test set or a scientific-only subset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that highlight ways to better align the empirical results with the manuscript's focus on scientific translation. We address each major comment below.

read point-by-point responses
  1. Referee: [Corpus and evaluation setup (Section 3 / 4)] Corpus and evaluation setup (Section 3 / 4): the manuscript states that the corpus is hybrid but supplies neither the fraction of scientific abstracts in the overall collection nor any indication that the test split is restricted to scientific material. No per-domain metric tables or breakdowns are provided. Because the central claim concerns scientific knowledge transfer, the absence of this information leaves open the possibility that reported gains are driven by easier general-domain examples, directly weakening the link between the empirical results and the stated motivation.

    Authors: We agree this information is necessary to substantiate the scientific motivation. The revised manuscript will report the exact fraction of scientific abstracts in the full corpus, clarify the test-split composition, and add per-domain metric tables. We will also report separate results on the scientific subset of the test set. revision: yes

  2. Referee: [Results section] Results section: the paper reports point estimates for BLEU, chrF, BERTScore and COMET but does not report statistical significance, standard deviation across random seeds, or confidence intervals for the +4.36 BLEU / +0.051 COMET deltas. Given that the headline claim rests on these specific improvements, the lack of significance testing is a load-bearing omission.

    Authors: We concur that statistical rigor is required. The revision will include standard deviations and confidence intervals computed over multiple random seeds, together with significance tests (e.g., bootstrap or paired tests) for the reported deltas. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper reports an empirical benchmark: construction of a hybrid parallel corpus (~27k pairs), fine-tuning of three models (mT5, NLLB, Qwen2.5-7B) via LoRA/QLoRA at varying ranks, and evaluation on held-out test data using standard metrics (BLEU, chrF, BERTScore, COMET) against zero-shot and few-shot baselines. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The reported gains (+4.36 BLEU, +0.051 COMET) are direct experimental outputs, not reductions by construction to the training data or prior author work. The hybrid corpus composition and lack of per-domain breakdowns raise validity questions but do not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work applies standard machine translation fine-tuning and evaluation practices without introducing new free parameters, axioms beyond domain conventions, or invented entities.

axioms (1)
  • domain assumption Automatic metrics such as BLEU and COMET correlate sufficiently with human judgments of translation quality for scientific text.
    Invoked when claiming that the reported score improvements constitute meaningful progress.

pith-pipeline@v0.9.1-grok · 5807 in / 1417 out tokens · 53601 ms · 2026-07-01T01:35:56.420526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 6 canonical work pages

  1. [1]

    In: International Conference on Learning Representations (ICLR) (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-Rank Adaptation of Large Language Models. In: International Conference on Learning Representations (ICLR) (2022). https://openreview.net/forum?id=nZeVKeeFYf9

  2. [2]

    In: Advances in Neural Information Processing Sys- tems, vol

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient Fine- tuning of Quantized LLMs. In: Advances in Neural Information Processing Sys- tems, vol. 36 (2024)

  3. [3]

    In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

    Rei, R., Stewart, C., Farinha, A.C., Lavie, A.: COMET: A Neural Framework for MT Evaluation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702. Association for Com- putational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp- main.213

  4. [4]

    doi:10.3115/1073083.1073135 , editor =

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Philadelphia, USA (2002). https://doi.org/10.3115/1073083.1073135

  5. [5]

    In: International Conference on Learning Rep- resentations (ICLR) (2020)

    Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating Text Generation with BERT. In: International Conference on Learning Rep- resentations (ICLR) (2020). https://openreview.net/forum?id=SkeHuCVFDr

  6. [6]

    In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp

    Popovic, M.: chrF: Character n-gram F-score for Automatic MT Evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–

  7. [7]

    https://aclanthology.org/W15-3049/

    Lisbon, Portugal (2015). https://aclanthology.org/W15-3049/

  8. [8]

    arXiv:2510.13430 (2025)

    Alzubaidi, A., Alsuwaidi, S., Boussaha, B.E.A., Al Qadi, L., Alkaabi, O., Alyafeai, M., Alobeidli, H., Hacid, H.: Evaluating Arabic Large Language Mod- els: A Survey of Benchmarks, Methods, and Gaps. arXiv:2510.13430 (2025). https://arxiv.org/abs/2510.13430

  9. [9]

    Computer Science Review38, 100307 (2020)

    Hadj Ameur, M.S., Guessoum, A.: A Survey on Arabic Machine Translation: Progress, Challenges, and Future Directions. Computer Science Review38, 100307 (2020). https://doi.org/10.1016/j.cosrev.2020.100307

  10. [10]

    In: Proceedings of the Third Arabic Natural Language Pro- cessing Conference (ArabicNLP 2025), pp

    Al-Matham, R.N., Darwish, K., Al-Rasheed, R., Alshammari, W., Alhoshan, M., Elsayed, T.: BALSAM: A Platform for Benchmarking Arabic Large Lan- guage Models. In: Proceedings of the Third Arabic Natural Language Pro- cessing Conference (ArabicNLP 2025), pp. 258–277. Suzhou, China (2025). https://aclanthology.org/2025.arabicnlp-1.19/

  11. [11]

    Hugging Face (2026)

    ArabicNLPWorld: Arabic-Russian Parallel Corpus. Hugging Face (2026). https://huggingface.co/datasets/ArabicNLPWorld/arabic-russian-parallel-corpus

  12. [12]

    Hugging Face (2026)

    ArabicNLPWorld: Arabic-Russian Scientific Translations. Hugging Face (2026). https://huggingface.co/datasets/ArabicNLPWorld/arabic-russian-scientific- translations

  13. [13]

    arXiv:2512.18834 (2025)

    Alrashed, S., Orabona, F.: AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus. arXiv:2512.18834 (2025). https://arxiv.org/abs/2512.18834

  14. [14]

    arXiv:2605.04948 (2026)

    Arabov, M.K., Khaybullina, S.S.: Adapting Large Language Models to a Low- Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir. arXiv:2605.04948 (2026). https://arxiv.org/abs/2605.04948

  15. [15]

    Song, Y., Li, L., Lothritz, C., Ezzini, S., Sleem, L., Bissyandé, T.F., Klein, J.: Are Small Language Models the Silver Bullet to Low-Resource Languages Machine Translation? In: Proceedings of the Ninth Workshop on Technologies for Machine LLM Benchmark for Arabic–Russian Translation 21 Translation of Low Resource Languages (LoResMT 2026), pp. 1–26. Rabat...

  16. [16]

    In: Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) (2024)

    Al-Khalifa, H., Darwish, K., Mubarak, H., Ali, M., Elsayed, T.: Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation. In: Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) (2024). https://acla...

  17. [17]

    In: Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family, pp

    Arabov, M.K.: TajPersLexon: A Tajik–Persian Lexical Resource and Hy- brid Model for Cross-Script Low-Resource NLP. In: Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family, pp. 29–37. Association for Computational Linguistics, Rabat, Morocco (2026). https://doi.org/10.18653/v1/2026.silkroadnlp-1.4

  18. [18]

    In: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pp

    Kurbonovich, A.M.: Character-Level Transformer for Tajik–Persian Translitera- tion with a Parallel Lexical Corpus. In: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pp. 75–83. Association for Computational Lin- guistics, Rabat, Morocco (2026). https://doi.org/10.18653/v1/2026.abjadnlp-1.10

  19. [19]

    In: Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Tech- nologies, pp

    Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., Raffel, C.: mT5: A Massively Multilingual Pre-trained Text-to-Text Trans- former. In: Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Tech- nologies, pp. 483–498. Association for Computatio...