From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

Lei Yang , Leiyu Pan , Bojian Xiong , Renren Jin , Shaowei Zhang , Yue Chen , Ling Shi , Jiang Zhou

show 9 more authors

Junru Wu Zhen Wang Jianxiang Peng Juesi Xiao Tianyu Dong Zhuowen Han Zhuo Chen Yuqi Ren Deyi Xiong

Authors on Pith no claims yet

classification 💻 cs.CL

keywords tibetanmodelslanguagecontinualdatadenselargepre-training

0 comments

read the original abstract

Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
cs.AI 2026-04 unverdicted novelty 5.0

IGDS uses sparse autoencoders to find internal task features in LLMs and selects data that maximally activates them, yielding better math reasoning performance than full-dataset fine-tuning with only half the data.
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
cs.CL 2026-05 conditional novelty 4.0

Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.