Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir
Pith reviewed 2026-06-30 23:35 UTC · model grok-4.3
The pith
QLoRA on Mistral-7B and Phi-2 reaches perplexity close to full fine-tuning on Bashkir using over 40 times fewer parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QLoRA applied to Mistral-7B and Phi-2 produces test perplexities of 3.79 and 3.81 on Bashkir text, numbers close to the 3.34 achieved by full fine-tuning of GPT-2 medium, yet with more than forty times fewer trainable parameters. Results vary sharply with the choice of base model and tokenizer; certain configurations yield perplexities above 100. Qualitative inspection shows that the parameter-efficient models tend to continue prompts in monolingual Bashkir while the lowest-perplexity full model often switches to English.
What carries the argument
QLoRA, the quantized low-rank adaptation method that freezes base model weights and trains only low-rank update matrices together with quantization parameters.
If this is right
- QLoRA on 7B-scale models supplies a practical compromise between output quality and computational cost when adapting to Bashkir.
- The success of any PEFT method depends critically on the specific base model and its tokenizer compatibility with the target language.
- Perplexity scores alone do not guarantee that generated text remains coherent and stays within the target language.
- Releasing the trained adapters, code, and data will let others verify or extend the comparison to additional models or languages.
Where Pith is reading between the lines
- The same efficiency pattern may appear for other low-resource agglutinative Turkic languages when the identical base models and QLoRA settings are reused.
- Pre-trained tokenizers from high-resource languages may impose an upper limit on how much any fine-tuning method can capture complex morphology regardless of adapter rank.
- Testing whether language-specific tokenizers or higher adapter ranks reduce the observed degradation cases would clarify the boundary conditions of the reported results.
Load-bearing premise
The 71k-document Bashkir corpus and the tokenizers of the tested base models are representative enough of the language's agglutinative morphology for the observed performance differences to hold.
What would settle it
Retraining the same QLoRA configurations on a larger or morphologically richer Bashkir corpus and finding that perplexity gaps widen substantially beyond the reported margin, or that full fine-tuning regains a decisive advantage, would undermine the claim of comparable quality at far lower cost.
read the original abstract
This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a comparative experimental study of LoRA and QLoRA (versus full fine-tuning) for adapting LLMs including DistilGPT2, GPT-2 variants, Phi-2, Qwen2.5-7B, DeepSeek-7B and Mistral-7B to Bashkir, a low-resource agglutinative Turkic language. Using a 71k-document (46.9M token) corpus and three random seeds per configuration, it reports that full fine-tuning of GPT-2 medium yields the lowest test perplexity (3.34), while QLoRA on Mistral-7B (3.79) and Phi-2 (3.81) reaches comparable perplexity with >40× fewer trainable parameters. The study also includes qualitative generation checks on Bashkir prompts, noting that lowest-perplexity models often switch to English whereas QLoRA outputs remain monolingual Bashkir, and concludes that QLoRA on 7B-scale models offers an effective quality-cost trade-off. The authors commit to releasing data, code and adapters.
Significance. If the experimental findings hold, the work supplies concrete, reproducible evidence on the viability of parameter-efficient methods for low-resource agglutinative languages and underscores that both base-model/tokenizer choice and the choice of evaluation metric (perplexity versus language fidelity) materially affect outcomes. The use of multiple seeds, the qualitative generation analysis, and the planned public release of resources are positive contributions to reproducibility.
major comments (1)
- [Abstract] Abstract: the central claim that QLoRA on Mistral-7B and Phi-2 achieves 'comparable quality' to full fine-tuning of GPT-2 medium rests on the reported perplexity values (3.79 / 3.81 versus 3.34). However, the same abstract states that the lowest-perplexity model 'frequently switched to English' on Bashkir prompts while the QLoRA models produced monolingual Bashkir continuations. This observation directly questions whether perplexity alone is a sufficient proxy for the intended notion of adaptation quality (faithful low-resource language modeling without language switching).
minor comments (1)
- [Abstract] The abstract mentions training with three random seeds but does not report variance, statistical significance tests, or the precise train/validation/test splits; adding these details would strengthen the reliability assessment of the perplexity figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below and will make the suggested clarification in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that QLoRA on Mistral-7B and Phi-2 achieves 'comparable quality' to full fine-tuning of GPT-2 medium rests on the reported perplexity values (3.79 / 3.81 versus 3.34). However, the same abstract states that the lowest-perplexity model 'frequently switched to English' on Bashkir prompts while the QLoRA models produced monolingual Bashkir continuations. This observation directly questions whether perplexity alone is a sufficient proxy for the intended notion of adaptation quality (faithful low-resource language modeling without language switching).
Authors: We agree that the phrasing in the abstract could be tightened. While the manuscript already reports the language-switching observation in the qualitative analysis and concludes that QLoRA offers a quality-cost trade-off, the abstract's use of 'comparable quality' is anchored primarily in perplexity. We will revise the abstract to explicitly note that QLoRA achieves comparable perplexity while additionally preserving monolingual Bashkir output (unlike the lowest-perplexity full fine-tuning run), thereby addressing the referee's point that perplexity alone is an incomplete proxy. This revision will be made without altering the reported numbers or experimental design. revision: yes
Circularity Check
No circularity: all results are direct experimental measurements on held-out data
full rationale
The paper conducts empirical fine-tuning experiments (LoRA, QLoRA, full FT) on a 71k-document Bashkir corpus across multiple base models, reporting perplexity on test sets and qualitative generation behavior. No equations, derivations, or 'predictions' are claimed; the lowest-perplexity result (GPT-2 medium full FT at 3.34) and QLoRA comparisons (Mistral-7B at 3.79, Phi-2 at 3.81) are raw measured values. The note on dependence on base model/tokenizer and the observation that low perplexity does not guarantee monolingual outputs are also direct empirical findings. No self-citations, ansatzes, or fitted inputs renamed as predictions appear in the provided text. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Perplexity on a held-out test set is a valid proxy for adaptation quality to Bashkir.
- domain assumption The 71k-document corpus adequately represents Bashkir for both training and evaluation.
Forward citations
Cited by 1 Pith paper
-
Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer
A new 27k-sentence Arabic-Russian parallel corpus supports fine-tuned LLM translation benchmarks that improve BLEU by 4.36 and COMET by 0.051 over zero-shot baselines for scientific content.
Reference graph
Works this paper leans on
-
[1]
Edward J
URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA:Low-rankadaptationoflargelanguagemodels. In10thInternationalConferenceonLearningRepresentations (ICLR 2022),
2017
-
[3]
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
URLhttps://arxiv.org/abs/2403.14608. LinglingXu,HaoranXie,JingQin,XiaohuiTao,andFrederickW.B.Li. Parameter-efficientfine-tuninginlargemodels: Asurveyofmethodologies. arXivpreprintarXiv:2410.19878,2024. URL https://arxiv.org/abs/2410.19878. Omkar Khade, Ananya Sharma, and Rohan Patel. Challenges in adapting multilingual LLMs to low-resource languages using...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Shruti Joshi, Meet Patel, and Vikas Kumar
URLhttps://arxiv.org/abs/2411.18571. Shruti Joshi, Meet Patel, and Vikas Kumar. Fine tuning LLMs for low resource languages: A comparative study of parameter-efficient methods. In2024 IEEE International Conference on Artificial Intelligence and Data Engineering (AIDE), pages 112–119. IEEE,
-
[5]
Kamala Baghirova, Lutfi Kerem Senel, Benedict Ebing, Hinrich Schuetze, and Goran Glavaš
doi:10.1109/AIDE62835.2024.00032. Kamala Baghirova, Lutfi Kerem Senel, Benedict Ebing, Hinrich Schuetze, and Goran Glavaš. Kardeş-NLU: Transfer to low-resource languages with the help of a high-resource cousin – a benchmark and evaluation for Turkic languages. 9 Adapting LLMs to Bashkir: LoRA and QLoRA StudyA Preprint InProceedings of the 18th Conference ...
-
[6]
URL https://aclanthology.org/2024.eacl-long.100/
Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.100/. Orken Mamyrbayev, Akbayan Bekarystankyzy, Mateus Mendes, Anar Fazylzhanova, and Mehwish Assam. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets.Scientific Reports, 14:13835,
2024
-
[7]
doi:10.1038/s41598-024-64848-1. Dmitry Karpov. No one-size-fits-all: Building systems for translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash using synthetic and original data. InProceedings of the Ninth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2026), pages 203–208, Rabat, Morocco,
-
[8]
doi:10.18653/v1/2026.loresmt-1.17
Association for Computational Linguistics. doi:10.18653/v1/2026.loresmt-1.17. URLhttps://aclanthology.org/2026.loresmt-1.17/. Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. How good is your tokenizer? On the monolingual performance of multilingual language models. InProceedings of the 59th Annual Meeting of the Associationf...
-
[9]
Rico Sennrich, Barry Haddow, and Alexandra Birch
doi:10.18653/v1/2021.acl- long.243. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725, Berlin, Germany,
-
[10]
Association for Computational Linguistics. doi:10.18653/v1/P16-1162. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.