pith. sign in

arxiv: 2402.06619 · v1 · pith:EXHXNSIJnew · submitted 2024-02-09 · 💻 cs.CL · cs.AI

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

classification 💻 cs.CL cs.AI
keywords datasetslanguagecollectiondatasetlanguagesbridgeexistinginstances
0
0 comments X
read the original abstract

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

    cs.CL 2026-06 unverdicted novelty 7.0

    BabelJudge introduces a perturbation-based framework to audit LLM judges for position bias, verbosity bias, order inconsistency, and cross-lingual degradation without human preference labels.

  2. Bayesian Model Merging

    cs.LG 2026-05 unverdicted novelty 6.0

    Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines a...

  3. DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

    cs.LG 2026-05 unverdicted novelty 6.0

    DynaMiCS uses short probing runs to build a slope matrix of cross-domain effects and solves a constrained optimization over mixture weights to improve targets while respecting performance bounds on constrained domains.

  4. COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

  5. DataComp-LM: In search of the next generation of training sets for language models

    cs.LG 2024-06 unverdicted novelty 6.0

    DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

  6. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.