pith. machine review for the scientific record. sign in

arxiv: 2605.09015 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.LG

Recognition: no theorem link

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:00 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords Sardinianlow-resource language adaptationLoRArsLoRAmachine translationcontinued pretrainingsingle-GPU fine-tuning
0
0 comments X

The pith

rsLoRA rank 256 adapts a 3B model to Sardinian and reaches 28.5 BLEU from English on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn a general 3 billion parameter model into one that handles Sardinian, a Romance language with roughly one million speakers and almost no presence in current NLP systems. It does so through continued pretraining on 11.5 million Sardinian tokens plus related Romance text, followed by supervised fine-tuning, all on a single 24 GB consumer GPU. Among the tested adaptation methods, rsLoRA at rank 256 produces the strongest results on translation into Sardinian and beats both the post-pretraining baseline and full fine-tuning. The work also finds that automatic scores like BLEU and perplexity rank the methods cleanly but miss clear qualitative differences, such as script leakage or factual fabrications that appear in some variants but not others. Perplexity itself must be interpreted carefully when scripts differ because of how tokenizers handle byte fallback.

Core claim

After continued pretraining the model reaches a perplexity of 6.76 on held-out Sardinian and beats the base model on every FLORES-200 direction. When five supervised fine-tuning configurations are compared under matched conditions, rsLoRA r256 records the highest BLEU scores into Sardinian (28.5 from English versus 17.3 after pretraining alone and 21.0 with full fine-tuning). The rank ablation shows that adapter capacity matters more than the choice among LoRA variants, that stronger regularization does not help uniformly, and that translation metrics order methods whose outputs differ in ways the metrics do not detect, including script leakage and confident fabrications on unseen content.

What carries the argument

rsLoRA adapter with rank 256, which updates a small subset of the model's parameters during supervised fine-tuning while the continued-pretraining stage has already exposed the base Qwen2.5-3B model to Sardinian and related Romance text.

If this is right

  • Higher adapter ranks improve BLEU scores on the target language more than switching between different LoRA formulations.
  • Full fine-tuning does not automatically give the best results when adapting a Romance-pretrained base to another Romance language.
  • All adaptation methods still produce fabrications on content absent from the training data.
  • Perplexity comparisons across scripts require explicit correction for byte-fallback tokenization effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same continued-pretraining-plus-adapter pipeline could be tested on other low-resource Romance varieties that share vocabulary and structure with Sardinian.
  • The replay of related Romance text during pretraining appears to limit register blurring, suggesting it could be reused when adapting to additional related languages.
  • Because automatic metrics miss categorical differences in output quality, any production use of the resulting model would still require targeted human checks for hallucinations and script consistency.

Load-bearing premise

BLEU scores, perplexity, and the chosen held-out sets are enough to tell whether the adapted model produces fluent, factually accurate Sardinian without new problems such as script leakage or confident hallucinations.

What would settle it

Human evaluation of model outputs on a set of prompts outside the training data, scoring each for grammatical fluency in Sardinian, factual accuracy, and presence of script mixing or invented facts.

read the original abstract

Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the adaptation of the Qwen2.5-3B-Instruct model to Sardinian, a low-resource Romance language, using continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24GB GPU. The approach uses 11.5M tokens of Sardinian data plus replay from related languages. It evaluates five SFT configurations and reports that rsLoRA with rank 256 achieves the highest BLEU scores on FLORES-200 into-Sardinian translations (28.5 from English), outperforming full fine-tuning (21.0) and CPT alone (17.3). The paper concludes that adapter capacity is more important than the specific LoRA variant, while noting limitations of automatic metrics in distinguishing qualitative differences.

Significance. If the findings are robust, this work is significant for demonstrating efficient, consumer-hardware adaptation of LLMs to vanishing languages with minimal data. It provides concrete comparisons of parameter-efficient methods and useful caveats about relying on BLEU/perplexity for low-resource settings, including the need to account for tokenization effects.

major comments (2)
  1. [Abstract / SFT evaluation] The central claim that rsLoRA r256 is superior rests on BLEU scores, yet the manuscript documents that translation metrics can mask categorical qualitative differences (e.g., script leakage in rsLoRA r128, fabrications in LoRA r64, poor factual accuracy in DoRA r256). No equivalent analysis of script leakage, hallucination rate, or factual retention is provided for the rsLoRA r256 winner on the held-out sets, leaving the superiority claim dependent on an unverified assumption that higher BLEU corresponds to better overall quality.
  2. [Evaluation methodology] The experimental results lack error bars, statistical significance tests, or details on whether data splits, prompts, or hyperparameters were selected post-hoc. This is particularly relevant given the small data regime and the reported perplexity of 6.76 after CPT.
minor comments (1)
  1. [Abstract] The final sentence on perplexity and byte fallback tokenization is important but could be expanded with a concrete example or reference to the relevant section in the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation approach and methodology. We address each major comment below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract / SFT evaluation] The central claim that rsLoRA r256 is superior rests on BLEU scores, yet the manuscript documents that translation metrics can mask categorical qualitative differences (e.g., script leakage in rsLoRA r128, fabrications in LoRA r64, poor factual accuracy in DoRA r256). No equivalent analysis of script leakage, hallucination rate, or factual retention is provided for the rsLoRA r256 winner on the held-out sets, leaving the superiority claim dependent on an unverified assumption that higher BLEU corresponds to better overall quality.

    Authors: We agree that the superiority claim for rsLoRA r256 would be strengthened by providing a qualitative analysis comparable to the failure modes we already document for the other configurations. The manuscript explicitly notes the limitations of BLEU in distinguishing qualitative differences and gives concrete examples for rsLoRA r128 (script leakage), LoRA r64 (fabrications), and DoRA r256 (factual accuracy issues). We did not include an equivalent examination for the top-performing rsLoRA r256 on held-out data. We will add this analysis in the revised manuscript, including checks for script consistency, hallucination rate, and factual retention on the held-out sets. revision: yes

  2. Referee: [Evaluation methodology] The experimental results lack error bars, statistical significance tests, or details on whether data splits, prompts, or hyperparameters were selected post-hoc. This is particularly relevant given the small data regime and the reported perplexity of 6.76 after CPT.

    Authors: We acknowledge that the current manuscript does not report error bars or statistical significance tests. This stems from the single 24 GB GPU constraint and the small data regime (11.5 M Sardinian tokens), which made multiple full training runs for variance estimation impractical. Data splits were fixed in advance using an 80/20 train/validation division on the collected corpora, with FLORES-200 serving as an external held-out test set; no post-hoc adjustment occurred on test data. Hyperparameters were chosen based on validation perplexity during CPT and a small development set for SFT. We will add a dedicated subsection in the experimental setup clarifying the split protocol, prompt templates, and hyperparameter selection process to address concerns about post-hoc choices. We will also include error bars from any feasible additional runs in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: results from direct empirical training and held-out evaluation

full rationale

The manuscript describes an empirical pipeline of continued pretraining on 11.5M Sardinian tokens followed by SFT under five adapter configurations, with final claims resting on BLEU and perplexity measured on held-out FLORES-200 and internal test sets. No equations, fitted parameters, or self-citations are invoked to derive the reported scores; the BLEU ordering (rsLoRA r256 at 28.5) is obtained by running the training and decoding steps on unseen data. The paper itself flags that automatic metrics miss certain qualitative failures, but this is an explicit limitation statement rather than a hidden reduction of the metric to its own inputs. The derivation chain therefore contains no self-definitional, fitted-input-called-prediction, or self-citation-load-bearing steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard transfer-learning assumptions for related Romance languages and on the premise that the collected 11.5 M tokens plus replay data are representative of Sardinian usage.

free parameters (2)
  • LoRA rank r
    Hyperparameter values (64, 128, 256) selected for the ablation study; the winning configuration is chosen after seeing results.
  • Replay token ratio
    2.4 M tokens of related Romance text added to prevent register blurring; exact proportion is a design choice.
axioms (2)
  • domain assumption Continued pretraining on a small related-language corpus improves downstream performance on the target language without catastrophic forgetting.
    Invoked to justify the CPT stage before SFT.
  • domain assumption BLEU on FLORES-200 directions is a sufficient proxy for overall translation quality into Sardinian.
    Used to declare rsLoRA r256 the winner.

pith-pipeline@v0.9.0 · 5682 in / 1586 out tokens · 56889 ms · 2026-05-12T02:00:51.792296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Mohammad Baqar and Rajat Khanda. 2025. Hallucinations and Truth: A Comprehensive Accuracy Evaluation of RAG, LoRA and DoRA. arXiv:2502.10497.https://arxiv.org/abs/2502.10497

  2. [2]

    Cunningham

    Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham

  3. [3]

    Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji

    LoRA Learns Less and Forgets Less. arXiv:2405.09673.https://arxiv.org/abs/2405.09673

  4. [4]

    Lifeng Chen, Ryan Lai, and Tianming Liu. 2025. Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study. arXiv:2512.03976.https://arxiv. org/abs/2512.03976

  5. [5]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Fine- tuning of Quantized LLMs. InAdvances in Neural Information Processing Systems 36. arXiv:2305.14314. https://arxiv.org/abs/2305.14314

  6. [6]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A fra...

  7. [7]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. https://arxiv.org/abs/2106.09685

  8. [8]

    Damjan Kalajdzievski. 2023. A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA. arXiv:2312.03732.https://arxiv.org/abs/2312.03732

  9. [9]

    LDJnr. 2023. Capybara: A multi-turn instruction-tuning dataset.https://huggingface.co/datasets/ LDJnr/Capybara

  10. [10]

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. 2024. DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv:2402.09353. https://arxiv.org/abs/2402.09353

  11. [11]

    Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, and Jun Shen. 2025. T-LLaMA: a Tibetan large language model based on LLaMA2.Complex & Intelligent Systems, 11(72).DOI: 10.1007/s40747-024-01641-7 . https://link.springer.com/article/10.1007/s40747-024-01641-7

  12. [12]

    NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, et al. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672.https://arxiv.org/abs/2207.04672

  13. [13]

    Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, et al. 2025. Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training. arXiv:2507.09205.https://arxiv.org/abs/2507.09205

  14. [14]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2.5 Technical Report. arXiv:2412.15115.https://arxiv.org/abs/2412.15115 15 A Translation evaluation results This appendix provides the full BLEU and chrF figures with standard errors for all seven model variants across the six FLORES-200 translation directions reported in Sectio...