pith. sign in

arxiv: 2503.18288 · v7 · pith:VI2CZS6Qnew · submitted 2025-03-24 · 💻 cs.CL

TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling

classification 💻 cs.CL
keywords tibetandatasetlanguagedatalarge-scalellmsmodelingsafety
0
0 comments X
read the original abstract

Large Language Models (LLMs) have achieved remarkable success in high-resource languages, yet progress in Tibetan remains severely constrained. While recent efforts have begun to address pre-training data scarcity for Tibetan, a more fundamental gap persists: no existing resource supports the complete LLM development pipeline, spanning pre-training, instruction tuning, safety alignment, preference optimization, and reasoning supervision. We introduce the Tibetan Foundation Dataset (TFD), the first structured, large-scale, and expert-curated dataset covering all key stages of Tibetan large language modeling. TFD comprises TIBSTC, a unified corpus of over 11 billion tokens with curated sub-datasets for instruction tuning, safety alignment, and preference optimization, and TIBSTC-CoT, the first large-scale Tibetan chain-of-thought dataset. We demonstrate its utility by training the Sun-Shine family of Tibetan LLMs, achieving substantial improvements over strong baselines on understanding, safety, reasoning, and generation benchmarks. These results underscore that advancing low-resource language modeling requires not only scale, but a structurally complete data ecosystem. We release TFD to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

    cs.SD 2026-05 unverdicted novelty 7.0

    Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

  2. From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan

    cs.CL 2025-07 unverdicted novelty 4.0

    A 72GB Tibetan corpus enables continual pre-training of Qwen2.5-7B and a 50B-A10B MoE model, with new benchmarks showing outperformance over prior Tibetan models.