arxiv: 2605.09015 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.LG

Recognition: no theorem link

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

Luca Ballore

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:00 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Sardinianlow-resource language adaptationLoRArsLoRAmachine translationcontinued pretrainingsingle-GPU fine-tuning

0 comments

The pith

rsLoRA rank 256 adapts a 3B model to Sardinian and reaches 28.5 BLEU from English on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn a general 3 billion parameter model into one that handles Sardinian, a Romance language with roughly one million speakers and almost no presence in current NLP systems. It does so through continued pretraining on 11.5 million Sardinian tokens plus related Romance text, followed by supervised fine-tuning, all on a single 24 GB consumer GPU. Among the tested adaptation methods, rsLoRA at rank 256 produces the strongest results on translation into Sardinian and beats both the post-pretraining baseline and full fine-tuning. The work also finds that automatic scores like BLEU and perplexity rank the methods cleanly but miss clear qualitative differences, such as script leakage or factual fabrications that appear in some variants but not others. Perplexity itself must be interpreted carefully when scripts differ because of how tokenizers handle byte fallback.

Core claim

After continued pretraining the model reaches a perplexity of 6.76 on held-out Sardinian and beats the base model on every FLORES-200 direction. When five supervised fine-tuning configurations are compared under matched conditions, rsLoRA r256 records the highest BLEU scores into Sardinian (28.5 from English versus 17.3 after pretraining alone and 21.0 with full fine-tuning). The rank ablation shows that adapter capacity matters more than the choice among LoRA variants, that stronger regularization does not help uniformly, and that translation metrics order methods whose outputs differ in ways the metrics do not detect, including script leakage and confident fabrications on unseen content.

What carries the argument

rsLoRA adapter with rank 256, which updates a small subset of the model's parameters during supervised fine-tuning while the continued-pretraining stage has already exposed the base Qwen2.5-3B model to Sardinian and related Romance text.

If this is right

Higher adapter ranks improve BLEU scores on the target language more than switching between different LoRA formulations.
Full fine-tuning does not automatically give the best results when adapting a Romance-pretrained base to another Romance language.
All adaptation methods still produce fabrications on content absent from the training data.
Perplexity comparisons across scripts require explicit correction for byte-fallback tokenization effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same continued-pretraining-plus-adapter pipeline could be tested on other low-resource Romance varieties that share vocabulary and structure with Sardinian.
The replay of related Romance text during pretraining appears to limit register blurring, suggesting it could be reused when adapting to additional related languages.
Because automatic metrics miss categorical differences in output quality, any production use of the resulting model would still require targeted human checks for hallucinations and script consistency.

Load-bearing premise

BLEU scores, perplexity, and the chosen held-out sets are enough to tell whether the adapted model produces fluent, factually accurate Sardinian without new problems such as script leakage or confident hallucinations.

What would settle it

Human evaluation of model outputs on a set of prompts outside the training data, scoring each for grammatical fluency in Sardinian, factual accuracy, and presence of script mixing or invented facts.

read the original abstract

Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical, grounded demonstration of adapting a 3B model to Sardinian on one GPU with a controlled adapter ablation, though the top BLEU result needs the same qualitative checks the paper applies to the other variants.

read the letter

The main thing to know is that the authors take Qwen2.5-3B, run continued pretraining on 11.5 million Sardinian tokens plus replay data, then compare five matched SFT setups and show rsLoRA at rank 256 reaching 28.5 BLEU into Sardinian on FLORES-200, ahead of full fine-tuning at 21.0. They also report held-out perplexity and note how byte-fallback tokenization affects cross-script comparisons. That is new for Sardinian at this scale, and the controlled rank ablation plus the explicit call-out of metric-blind failure modes (script leakage, fabrications, factual loss) is the part that actually adds value beyond just another low-resource fine-tune story. The replay token trick to limit register blurring is a sensible, reproducible detail for related Romance languages. They keep the whole thing runnable on a single 24 GB GPU, which matches the title claim. The experimental design is straightforward and the numbers are reported directly, so the work shows clear thinking about what matters for this language pair. The soft spot is the one the stress test highlights: the paper documents that BLEU orders configurations whose real behavior differs sharply, with concrete examples of problems only visible in lower-rank or DoRA runs, yet it does not apply the same manual or factual-accuracy checks to the rsLoRA r256 winner. That leaves the central claim resting on the metric without confirming the winner avoids the same issues. No error bars, no released code or data, and a few open questions on whether splits and prompts were locked before seeing results. None of this breaks the paper, but it caps how far you can take the ranking at face value. This is for people doing consumer-scale adaptation of small models to languages with roughly a million speakers or similar data constraints. A reader who wants concrete recipes and honest notes on where automatic metrics fall short will get something usable from it. It has enough new empirical content and honest engagement with its own limitations to deserve a serious referee, even if the qualitative verification step is still needed.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the adaptation of the Qwen2.5-3B-Instruct model to Sardinian, a low-resource Romance language, using continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24GB GPU. The approach uses 11.5M tokens of Sardinian data plus replay from related languages. It evaluates five SFT configurations and reports that rsLoRA with rank 256 achieves the highest BLEU scores on FLORES-200 into-Sardinian translations (28.5 from English), outperforming full fine-tuning (21.0) and CPT alone (17.3). The paper concludes that adapter capacity is more important than the specific LoRA variant, while noting limitations of automatic metrics in distinguishing qualitative differences.

Significance. If the findings are robust, this work is significant for demonstrating efficient, consumer-hardware adaptation of LLMs to vanishing languages with minimal data. It provides concrete comparisons of parameter-efficient methods and useful caveats about relying on BLEU/perplexity for low-resource settings, including the need to account for tokenization effects.

major comments (2)

[Abstract / SFT evaluation] The central claim that rsLoRA r256 is superior rests on BLEU scores, yet the manuscript documents that translation metrics can mask categorical qualitative differences (e.g., script leakage in rsLoRA r128, fabrications in LoRA r64, poor factual accuracy in DoRA r256). No equivalent analysis of script leakage, hallucination rate, or factual retention is provided for the rsLoRA r256 winner on the held-out sets, leaving the superiority claim dependent on an unverified assumption that higher BLEU corresponds to better overall quality.
[Evaluation methodology] The experimental results lack error bars, statistical significance tests, or details on whether data splits, prompts, or hyperparameters were selected post-hoc. This is particularly relevant given the small data regime and the reported perplexity of 6.76 after CPT.

minor comments (1)

[Abstract] The final sentence on perplexity and byte fallback tokenization is important but could be expanded with a concrete example or reference to the relevant section in the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation approach and methodology. We address each major comment below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract / SFT evaluation] The central claim that rsLoRA r256 is superior rests on BLEU scores, yet the manuscript documents that translation metrics can mask categorical qualitative differences (e.g., script leakage in rsLoRA r128, fabrications in LoRA r64, poor factual accuracy in DoRA r256). No equivalent analysis of script leakage, hallucination rate, or factual retention is provided for the rsLoRA r256 winner on the held-out sets, leaving the superiority claim dependent on an unverified assumption that higher BLEU corresponds to better overall quality.

Authors: We agree that the superiority claim for rsLoRA r256 would be strengthened by providing a qualitative analysis comparable to the failure modes we already document for the other configurations. The manuscript explicitly notes the limitations of BLEU in distinguishing qualitative differences and gives concrete examples for rsLoRA r128 (script leakage), LoRA r64 (fabrications), and DoRA r256 (factual accuracy issues). We did not include an equivalent examination for the top-performing rsLoRA r256 on held-out data. We will add this analysis in the revised manuscript, including checks for script consistency, hallucination rate, and factual retention on the held-out sets. revision: yes
Referee: [Evaluation methodology] The experimental results lack error bars, statistical significance tests, or details on whether data splits, prompts, or hyperparameters were selected post-hoc. This is particularly relevant given the small data regime and the reported perplexity of 6.76 after CPT.

Authors: We acknowledge that the current manuscript does not report error bars or statistical significance tests. This stems from the single 24 GB GPU constraint and the small data regime (11.5 M Sardinian tokens), which made multiple full training runs for variance estimation impractical. Data splits were fixed in advance using an 80/20 train/validation division on the collected corpora, with FLORES-200 serving as an external held-out test set; no post-hoc adjustment occurred on test data. Hyperparameters were chosen based on validation perplexity during CPT and a small development set for SFT. We will add a dedicated subsection in the experimental setup clarifying the split protocol, prompt templates, and hyperparameter selection process to address concerns about post-hoc choices. We will also include error bars from any feasible additional runs in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: results from direct empirical training and held-out evaluation

full rationale

The manuscript describes an empirical pipeline of continued pretraining on 11.5M Sardinian tokens followed by SFT under five adapter configurations, with final claims resting on BLEU and perplexity measured on held-out FLORES-200 and internal test sets. No equations, fitted parameters, or self-citations are invoked to derive the reported scores; the BLEU ordering (rsLoRA r256 at 28.5) is obtained by running the training and decoding steps on unseen data. The paper itself flags that automatic metrics miss certain qualitative failures, but this is an explicit limitation statement rather than a hidden reduction of the metric to its own inputs. The derivation chain therefore contains no self-definitional, fitted-input-called-prediction, or self-citation-load-bearing steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard transfer-learning assumptions for related Romance languages and on the premise that the collected 11.5 M tokens plus replay data are representative of Sardinian usage.

free parameters (2)

LoRA rank r
Hyperparameter values (64, 128, 256) selected for the ablation study; the winning configuration is chosen after seeing results.
Replay token ratio
2.4 M tokens of related Romance text added to prevent register blurring; exact proportion is a design choice.

axioms (2)

domain assumption Continued pretraining on a small related-language corpus improves downstream performance on the target language without catastrophic forgetting.
Invoked to justify the CPT stage before SFT.
domain assumption BLEU on FLORES-200 directions is a sufficient proxy for overall translation quality into Sardinian.
Used to declare rsLoRA r256 the winner.

pith-pipeline@v0.9.0 · 5682 in / 1586 out tokens · 56889 ms · 2026-05-12T02:00:51.792296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Mohammad Baqar and Rajat Khanda. 2025. Hallucinations and Truth: A Comprehensive Accuracy Evaluation of RAG, LoRA and DoRA. arXiv:2502.10497.https://arxiv.org/abs/2502.10497

work page arXiv 2025
[2]

Cunningham

Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham

work page
[3]

Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji

LoRA Learns Less and Forgets Less. arXiv:2405.09673.https://arxiv.org/abs/2405.09673

work page arXiv
[4]

Lifeng Chen, Ryan Lai, and Tianming Liu. 2025. Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study. arXiv:2512.03976.https://arxiv. org/abs/2512.03976

work page arXiv 2025
[5]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Fine- tuning of Quantized LLMs. InAdvances in Neural Information Processing Systems 36. arXiv:2305.14314. https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A fra...

work page doi:10.5281/zenodo.10256836 2023
[7]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Damjan Kalajdzievski. 2023. A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA. arXiv:2312.03732.https://arxiv.org/abs/2312.03732

work page arXiv 2023
[9]

LDJnr. 2023. Capybara: A multi-turn instruction-tuning dataset.https://huggingface.co/datasets/ LDJnr/Capybara

work page 2023
[10]

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv:2402.09353. https://arxiv.org/abs/2402.09353

work page arXiv 2024
[11]

Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, and Jun Shen. 2025. T-LLaMA: a Tibetan large language model based on LLaMA2.Complex & Intelligent Systems, 11(72).DOI: 10.1007/s40747-024-01641-7 . https://link.springer.com/article/10.1007/s40747-024-01641-7

work page doi:10.1007/s40747-024-01641-7 2025
[12]

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, et al. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672.https://arxiv.org/abs/2207.04672

work page internal anchor Pith review arXiv 2022
[13]

Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, et al. 2025. Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training. arXiv:2507.09205.https://arxiv.org/abs/2507.09205

work page internal anchor Pith review arXiv 2025
[14]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2.5 Technical Report. arXiv:2412.15115.https://arxiv.org/abs/2412.15115 15 A Translation evaluation results This appendix provides the full BLEU and chrF figures with standard errors for all seven model variants across the six FLORES-200 translation directions reported in Sectio...

work page internal anchor Pith review Pith/arXiv arXiv 2024