Recognition: no theorem link
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
Pith reviewed 2026-05-12 02:00 UTC · model grok-4.3
The pith
rsLoRA rank 256 adapts a 3B model to Sardinian and reaches 28.5 BLEU from English on a single GPU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After continued pretraining the model reaches a perplexity of 6.76 on held-out Sardinian and beats the base model on every FLORES-200 direction. When five supervised fine-tuning configurations are compared under matched conditions, rsLoRA r256 records the highest BLEU scores into Sardinian (28.5 from English versus 17.3 after pretraining alone and 21.0 with full fine-tuning). The rank ablation shows that adapter capacity matters more than the choice among LoRA variants, that stronger regularization does not help uniformly, and that translation metrics order methods whose outputs differ in ways the metrics do not detect, including script leakage and confident fabrications on unseen content.
What carries the argument
rsLoRA adapter with rank 256, which updates a small subset of the model's parameters during supervised fine-tuning while the continued-pretraining stage has already exposed the base Qwen2.5-3B model to Sardinian and related Romance text.
If this is right
- Higher adapter ranks improve BLEU scores on the target language more than switching between different LoRA formulations.
- Full fine-tuning does not automatically give the best results when adapting a Romance-pretrained base to another Romance language.
- All adaptation methods still produce fabrications on content absent from the training data.
- Perplexity comparisons across scripts require explicit correction for byte-fallback tokenization effects.
Where Pith is reading between the lines
- The same continued-pretraining-plus-adapter pipeline could be tested on other low-resource Romance varieties that share vocabulary and structure with Sardinian.
- The replay of related Romance text during pretraining appears to limit register blurring, suggesting it could be reused when adapting to additional related languages.
- Because automatic metrics miss categorical differences in output quality, any production use of the resulting model would still require targeted human checks for hallucinations and script consistency.
Load-bearing premise
BLEU scores, perplexity, and the chosen held-out sets are enough to tell whether the adapted model produces fluent, factually accurate Sardinian without new problems such as script leakage or confident hallucinations.
What would settle it
Human evaluation of model outputs on a set of prompts outside the training data, scoring each for grammatical fluency in Sardinian, factual accuracy, and presence of script mixing or invented facts.
read the original abstract
Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the adaptation of the Qwen2.5-3B-Instruct model to Sardinian, a low-resource Romance language, using continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24GB GPU. The approach uses 11.5M tokens of Sardinian data plus replay from related languages. It evaluates five SFT configurations and reports that rsLoRA with rank 256 achieves the highest BLEU scores on FLORES-200 into-Sardinian translations (28.5 from English), outperforming full fine-tuning (21.0) and CPT alone (17.3). The paper concludes that adapter capacity is more important than the specific LoRA variant, while noting limitations of automatic metrics in distinguishing qualitative differences.
Significance. If the findings are robust, this work is significant for demonstrating efficient, consumer-hardware adaptation of LLMs to vanishing languages with minimal data. It provides concrete comparisons of parameter-efficient methods and useful caveats about relying on BLEU/perplexity for low-resource settings, including the need to account for tokenization effects.
major comments (2)
- [Abstract / SFT evaluation] The central claim that rsLoRA r256 is superior rests on BLEU scores, yet the manuscript documents that translation metrics can mask categorical qualitative differences (e.g., script leakage in rsLoRA r128, fabrications in LoRA r64, poor factual accuracy in DoRA r256). No equivalent analysis of script leakage, hallucination rate, or factual retention is provided for the rsLoRA r256 winner on the held-out sets, leaving the superiority claim dependent on an unverified assumption that higher BLEU corresponds to better overall quality.
- [Evaluation methodology] The experimental results lack error bars, statistical significance tests, or details on whether data splits, prompts, or hyperparameters were selected post-hoc. This is particularly relevant given the small data regime and the reported perplexity of 6.76 after CPT.
minor comments (1)
- [Abstract] The final sentence on perplexity and byte fallback tokenization is important but could be expanded with a concrete example or reference to the relevant section in the full text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation approach and methodology. We address each major comment below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract / SFT evaluation] The central claim that rsLoRA r256 is superior rests on BLEU scores, yet the manuscript documents that translation metrics can mask categorical qualitative differences (e.g., script leakage in rsLoRA r128, fabrications in LoRA r64, poor factual accuracy in DoRA r256). No equivalent analysis of script leakage, hallucination rate, or factual retention is provided for the rsLoRA r256 winner on the held-out sets, leaving the superiority claim dependent on an unverified assumption that higher BLEU corresponds to better overall quality.
Authors: We agree that the superiority claim for rsLoRA r256 would be strengthened by providing a qualitative analysis comparable to the failure modes we already document for the other configurations. The manuscript explicitly notes the limitations of BLEU in distinguishing qualitative differences and gives concrete examples for rsLoRA r128 (script leakage), LoRA r64 (fabrications), and DoRA r256 (factual accuracy issues). We did not include an equivalent examination for the top-performing rsLoRA r256 on held-out data. We will add this analysis in the revised manuscript, including checks for script consistency, hallucination rate, and factual retention on the held-out sets. revision: yes
-
Referee: [Evaluation methodology] The experimental results lack error bars, statistical significance tests, or details on whether data splits, prompts, or hyperparameters were selected post-hoc. This is particularly relevant given the small data regime and the reported perplexity of 6.76 after CPT.
Authors: We acknowledge that the current manuscript does not report error bars or statistical significance tests. This stems from the single 24 GB GPU constraint and the small data regime (11.5 M Sardinian tokens), which made multiple full training runs for variance estimation impractical. Data splits were fixed in advance using an 80/20 train/validation division on the collected corpora, with FLORES-200 serving as an external held-out test set; no post-hoc adjustment occurred on test data. Hyperparameters were chosen based on validation perplexity during CPT and a small development set for SFT. We will add a dedicated subsection in the experimental setup clarifying the split protocol, prompt templates, and hyperparameter selection process to address concerns about post-hoc choices. We will also include error bars from any feasible additional runs in the revision. revision: partial
Circularity Check
No circularity: results from direct empirical training and held-out evaluation
full rationale
The manuscript describes an empirical pipeline of continued pretraining on 11.5M Sardinian tokens followed by SFT under five adapter configurations, with final claims resting on BLEU and perplexity measured on held-out FLORES-200 and internal test sets. No equations, fitted parameters, or self-citations are invoked to derive the reported scores; the BLEU ordering (rsLoRA r256 at 28.5) is obtained by running the training and decoding steps on unseen data. The paper itself flags that automatic metrics miss certain qualitative failures, but this is an explicit limitation statement rather than a hidden reduction of the metric to its own inputs. The derivation chain therefore contains no self-definitional, fitted-input-called-prediction, or self-citation-load-bearing steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank r
- Replay token ratio
axioms (2)
- domain assumption Continued pretraining on a small related-language corpus improves downstream performance on the target language without catastrophic forgetting.
- domain assumption BLEU on FLORES-200 directions is a sufficient proxy for overall translation quality into Sardinian.
Reference graph
Works this paper leans on
- [1]
-
[2]
Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham
-
[3]
Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji
LoRA Learns Less and Forgets Less. arXiv:2405.09673.https://arxiv.org/abs/2405.09673
- [4]
-
[5]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Fine- tuning of Quantized LLMs. InAdvances in Neural Information Processing Systems 36. arXiv:2305.14314. https://arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A fra...
-
[7]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [8]
-
[9]
LDJnr. 2023. Capybara: A multi-turn instruction-tuning dataset.https://huggingface.co/datasets/ LDJnr/Capybara
work page 2023
- [10]
-
[11]
Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, and Jun Shen. 2025. T-LLaMA: a Tibetan large language model based on LLaMA2.Complex & Intelligent Systems, 11(72).DOI: 10.1007/s40747-024-01641-7 . https://link.springer.com/article/10.1007/s40747-024-01641-7
-
[12]
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, et al. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672.https://arxiv.org/abs/2207.04672
work page internal anchor Pith review arXiv 2022
-
[13]
Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, et al. 2025. Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training. arXiv:2507.09205.https://arxiv.org/abs/2507.09205
work page internal anchor Pith review arXiv 2025
-
[14]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2.5 Technical Report. arXiv:2412.15115.https://arxiv.org/abs/2412.15115 15 A Translation evaluation results This appendix provides the full BLEU and chrF figures with standard errors for all seven model variants across the six FLORES-200 translation directions reported in Sectio...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.