Recognition: unknown
From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
read the original abstract
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
IGDS uses sparse autoencoders to find internal task features in LLMs and selects data that maximally activates them, yielding better math reasoning performance than full-dataset fine-tuning with only half the data.
-
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.