pith. sign in

arxiv: 2605.04948 · v2 · pith:AWBWG7T2new · submitted 2026-05-06 · 💻 cs.CL

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

Pith reviewed 2026-06-30 23:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords low-resource language adaptationparameter-efficient fine-tuningLoRAQLoRABashkiragglutinative languagesperplexity evaluationlanguage model fine-tuning
0
0 comments X

The pith

QLoRA on Mistral-7B and Phi-2 reaches perplexity close to full fine-tuning on Bashkir using over 40 times fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares LoRA and QLoRA against full fine-tuning when adapting several large language models to Bashkir, a low-resource agglutinative Turkic language. It trains each configuration on a 71k-document corpus of 46.9 million tokens and evaluates perplexity on a held-out test set, repeating runs with three random seeds for reliability. QLoRA versions of the 7B-scale models deliver test perplexities of 3.79 and 3.81, numbers close to the 3.34 achieved by full fine-tuning of GPT-2 medium, while updating more than forty times fewer parameters. The study also records cases where PEFT methods cause sharp quality drops depending on the base model and its tokenizer, and notes that the lowest-perplexity model sometimes switches to English in generated text while the QLoRA models continue in Bashkir.

Core claim

QLoRA applied to Mistral-7B and Phi-2 produces test perplexities of 3.79 and 3.81 on Bashkir text, numbers close to the 3.34 achieved by full fine-tuning of GPT-2 medium, yet with more than forty times fewer trainable parameters. Results vary sharply with the choice of base model and tokenizer; certain configurations yield perplexities above 100. Qualitative inspection shows that the parameter-efficient models tend to continue prompts in monolingual Bashkir while the lowest-perplexity full model often switches to English.

What carries the argument

QLoRA, the quantized low-rank adaptation method that freezes base model weights and trains only low-rank update matrices together with quantization parameters.

If this is right

  • QLoRA on 7B-scale models supplies a practical compromise between output quality and computational cost when adapting to Bashkir.
  • The success of any PEFT method depends critically on the specific base model and its tokenizer compatibility with the target language.
  • Perplexity scores alone do not guarantee that generated text remains coherent and stays within the target language.
  • Releasing the trained adapters, code, and data will let others verify or extend the comparison to additional models or languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same efficiency pattern may appear for other low-resource agglutinative Turkic languages when the identical base models and QLoRA settings are reused.
  • Pre-trained tokenizers from high-resource languages may impose an upper limit on how much any fine-tuning method can capture complex morphology regardless of adapter rank.
  • Testing whether language-specific tokenizers or higher adapter ranks reduce the observed degradation cases would clarify the boundary conditions of the reported results.

Load-bearing premise

The 71k-document Bashkir corpus and the tokenizers of the tested base models are representative enough of the language's agglutinative morphology for the observed performance differences to hold.

What would settle it

Retraining the same QLoRA configurations on a larger or morphologically richer Bashkir corpus and finding that perplexity gaps widen substantially beyond the reported margin, or that full fine-tuning regains a decisive advantage, would undermine the claim of comparable quality at far lower cost.

read the original abstract

This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents a comparative experimental study of LoRA and QLoRA (versus full fine-tuning) for adapting LLMs including DistilGPT2, GPT-2 variants, Phi-2, Qwen2.5-7B, DeepSeek-7B and Mistral-7B to Bashkir, a low-resource agglutinative Turkic language. Using a 71k-document (46.9M token) corpus and three random seeds per configuration, it reports that full fine-tuning of GPT-2 medium yields the lowest test perplexity (3.34), while QLoRA on Mistral-7B (3.79) and Phi-2 (3.81) reaches comparable perplexity with >40× fewer trainable parameters. The study also includes qualitative generation checks on Bashkir prompts, noting that lowest-perplexity models often switch to English whereas QLoRA outputs remain monolingual Bashkir, and concludes that QLoRA on 7B-scale models offers an effective quality-cost trade-off. The authors commit to releasing data, code and adapters.

Significance. If the experimental findings hold, the work supplies concrete, reproducible evidence on the viability of parameter-efficient methods for low-resource agglutinative languages and underscores that both base-model/tokenizer choice and the choice of evaluation metric (perplexity versus language fidelity) materially affect outcomes. The use of multiple seeds, the qualitative generation analysis, and the planned public release of resources are positive contributions to reproducibility.

major comments (1)
  1. [Abstract] Abstract: the central claim that QLoRA on Mistral-7B and Phi-2 achieves 'comparable quality' to full fine-tuning of GPT-2 medium rests on the reported perplexity values (3.79 / 3.81 versus 3.34). However, the same abstract states that the lowest-perplexity model 'frequently switched to English' on Bashkir prompts while the QLoRA models produced monolingual Bashkir continuations. This observation directly questions whether perplexity alone is a sufficient proxy for the intended notion of adaptation quality (faithful low-resource language modeling without language switching).
minor comments (1)
  1. [Abstract] The abstract mentions training with three random seeds but does not report variance, statistical significance tests, or the precise train/validation/test splits; adding these details would strengthen the reliability assessment of the perplexity figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will make the suggested clarification in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that QLoRA on Mistral-7B and Phi-2 achieves 'comparable quality' to full fine-tuning of GPT-2 medium rests on the reported perplexity values (3.79 / 3.81 versus 3.34). However, the same abstract states that the lowest-perplexity model 'frequently switched to English' on Bashkir prompts while the QLoRA models produced monolingual Bashkir continuations. This observation directly questions whether perplexity alone is a sufficient proxy for the intended notion of adaptation quality (faithful low-resource language modeling without language switching).

    Authors: We agree that the phrasing in the abstract could be tightened. While the manuscript already reports the language-switching observation in the qualitative analysis and concludes that QLoRA offers a quality-cost trade-off, the abstract's use of 'comparable quality' is anchored primarily in perplexity. We will revise the abstract to explicitly note that QLoRA achieves comparable perplexity while additionally preserving monolingual Bashkir output (unlike the lowest-perplexity full fine-tuning run), thereby addressing the referee's point that perplexity alone is an incomplete proxy. This revision will be made without altering the reported numbers or experimental design. revision: yes

Circularity Check

0 steps flagged

No circularity: all results are direct experimental measurements on held-out data

full rationale

The paper conducts empirical fine-tuning experiments (LoRA, QLoRA, full FT) on a 71k-document Bashkir corpus across multiple base models, reporting perplexity on test sets and qualitative generation behavior. No equations, derivations, or 'predictions' are claimed; the lowest-perplexity result (GPT-2 medium full FT at 3.34) and QLoRA comparisons (Mistral-7B at 3.79, Phi-2 at 3.81) are raw measured values. The note on dependence on base model/tokenizer and the observation that low perplexity does not guarantee monolingual outputs are also direct empirical findings. No self-citations, ansatzes, or fitted inputs renamed as predictions appear in the provided text. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard NLP assumptions about evaluation metrics and data sufficiency rather than new postulates; no free parameters or invented entities are introduced beyond the experimental setup.

axioms (2)
  • domain assumption Perplexity on a held-out test set is a valid proxy for adaptation quality to Bashkir.
    Used to declare which configuration performed best.
  • domain assumption The 71k-document corpus adequately represents Bashkir for both training and evaluation.
    Foundation for all reported numbers.

pith-pipeline@v0.9.1-grok · 5885 in / 1487 out tokens · 35122 ms · 2026-06-30T23:35:40.920108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

    cs.CL 2026-06 unverdicted novelty 4.0

    A new 27k-sentence Arabic-Russian parallel corpus supports fine-tuned LLM translation benchmarks that improve BLEU by 4.36 and COMET by 0.051 over zero-shot baselines for scientific content.

Reference graph

Works this paper leans on

9 extracted references · 7 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Edward J

    URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA:Low-rankadaptationoflargelanguagemodels. In10thInternationalConferenceonLearningRepresentations (ICLR 2022),

  2. [3]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    URLhttps://arxiv.org/abs/2403.14608. LinglingXu,HaoranXie,JingQin,XiaohuiTao,andFrederickW.B.Li. Parameter-efficientfine-tuninginlargemodels: Asurveyofmethodologies. arXivpreprintarXiv:2410.19878,2024. URL https://arxiv.org/abs/2410.19878. Omkar Khade, Ananya Sharma, and Rohan Patel. Challenges in adapting multilingual LLMs to low-resource languages using...

  3. [4]

    Shruti Joshi, Meet Patel, and Vikas Kumar

    URLhttps://arxiv.org/abs/2411.18571. Shruti Joshi, Meet Patel, and Vikas Kumar. Fine tuning LLMs for low resource languages: A comparative study of parameter-efficient methods. In2024 IEEE International Conference on Artificial Intelligence and Data Engineering (AIDE), pages 112–119. IEEE,

  4. [5]

    Kamala Baghirova, Lutfi Kerem Senel, Benedict Ebing, Hinrich Schuetze, and Goran Glavaš

    doi:10.1109/AIDE62835.2024.00032. Kamala Baghirova, Lutfi Kerem Senel, Benedict Ebing, Hinrich Schuetze, and Goran Glavaš. Kardeş-NLU: Transfer to low-resource languages with the help of a high-resource cousin – a benchmark and evaluation for Turkic languages. 9 Adapting LLMs to Bashkir: LoRA and QLoRA StudyA Preprint InProceedings of the 18th Conference ...

  5. [6]

    URL https://aclanthology.org/2024.eacl-long.100/

    Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.100/. Orken Mamyrbayev, Akbayan Bekarystankyzy, Mateus Mendes, Anar Fazylzhanova, and Mehwish Assam. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets.Scientific Reports, 14:13835,

  6. [7]

    Dmitry Karpov

    doi:10.1038/s41598-024-64848-1. Dmitry Karpov. No one-size-fits-all: Building systems for translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash using synthetic and original data. InProceedings of the Ninth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2026), pages 203–208, Rabat, Morocco,

  7. [8]

    doi:10.18653/v1/2026.loresmt-1.17

    Association for Computational Linguistics. doi:10.18653/v1/2026.loresmt-1.17. URLhttps://aclanthology.org/2026.loresmt-1.17/. Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. How good is your tokenizer? On the monolingual performance of multilingual language models. InProceedings of the 59th Annual Meeting of the Associationf...

  8. [9]

    Rico Sennrich, Barry Haddow, and Alexandra Birch

    doi:10.18653/v1/2021.acl- long.243. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725, Berlin, Germany,

  9. [10]

    doi:10.18653/v1/P16-1162

    Association for Computational Linguistics. doi:10.18653/v1/P16-1162. 10