pith. machine review for the scientific record. sign in

arxiv: 2507.09205 · v5 · submitted 2025-07-12 · 💻 cs.CL

Recognition: unknown

From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

Authors on Pith no claims yet
classification 💻 cs.CL
keywords tibetanmodelslanguagecontinualdatadenselargepre-training
0
0 comments X
read the original abstract

Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    IGDS uses sparse autoencoders to find internal task features in LLMs and selects data that maximally activates them, yielding better math reasoning performance than full-dataset fine-tuning with only half the data.

  2. LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

    cs.CL 2026-05 conditional novelty 4.0

    Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.