Building Community-Centred NLP Resources for Puno Quechua

Adrian Gamarra Lafuente; Anna Korhonen; Elwin Huaman; Johanna Cordova

arxiv: 2605.28253 · v1 · pith:YFSYU6N3new · submitted 2026-05-27 · 💻 cs.CL · cs.DB· cs.HC

Building Community-Centred NLP Resources for Puno Quechua

Elwin Huaman , Adrian Gamarra Lafuente , Johanna Cordova , Anna Korhonen This is my paper

Pith reviewed 2026-06-29 13:22 UTC · model grok-4.3

classification 💻 cs.CL cs.DBcs.HC

keywords Puno Quechuaautomatic speech recognitionspeech corpusparticipatory designlow-resource languagesmodel fine-tuningWhisperwav2vec2

0 comments

The pith

Puno Quechua gains its first dedicated speech corpus and ASR benchmarks from community-collected data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build the first automatic speech recognition resources tailored to Puno Quechua by running a participatory design campaign that gathers 66 hours of scripted and spontaneous recordings. Thirty-six hours receive manual transcription and validation, after which several current models receive evaluation and fine-tuning with and without continued pre-training before all data and models are released openly. A sympathetic reader would care because this supplies concrete digital tools for a language variety that previously had none, created with direct speaker involvement rather than external scraping alone. The work therefore links resource creation to community agency in language technology.

Core claim

The first dedicated ASR resources for Puno Quechua consist of the largest speech corpus for any single Quechua variety at 66 hours of recordings with 36 hours of manually transcribed and validated data collected via participatory design, the first systematic benchmark that evaluates state-of-the-art models and fine-tunes Whisper-base, wav2vec2-base, and XLS-R-300M with and without continued pre-training, plus open release of all datasets and fine-tuned models.

What carries the argument

Participatory design campaign that collects, transcribes, and validates speech data, followed by fine-tuning of pre-trained ASR models on the resulting Puno Quechua corpus.

If this is right

Puno Quechua speakers obtain openly available datasets and models that can support building local speech applications.
Subsequent ASR research on Quechua varieties obtains a concrete baseline for comparison.
Community-shaped data collection produces resources aligned with actual scripted and spontaneous usage patterns.
Open release enables other researchers to extend or adapt the resources without starting from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same participatory collection method could be replicated for other Quechua varieties or unrelated low-resource languages.
Continued pre-training on small target-language corpora may prove a reusable tactic for improving ASR accuracy in similar settings.
The released models could serve as starting points for downstream tasks such as spoken language documentation or translation aids.

Load-bearing premise

The participatory design campaign and manual transcription produce data of sufficient quality, diversity, and representativeness to support effective ASR model development and benchmarking.

What would settle it

If fine-tuned models show no measurable word-error-rate improvement over their zero-shot baselines when tested on held-out Puno Quechua audio, the claimed utility of the new corpus and benchmarks would be refuted.

Figures

Figures reproduced from arXiv: 2605.28253 by Adrian Gamarra Lafuente, Anna Korhonen, Elwin Huaman, Johanna Cordova.

**Figure 1.** Figure 1: Puno Quechua ASR Pipeline. not formally identified, or they are aggregated by linguistic group (e.g. Southern Quechua (Cardenas et al., 2018) or Collao Quechua (PaccotacyaYanque et al., 2022)) without any proper examination of how their differences may affect practical applications. Furthermore, existing corpora suffer from data scarcity, restricted access (Siminchik by Cardenas et al., one of the larg… view at source ↗

read the original abstract

The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases the first dedicated speech corpus and ASR benchmark for Puno Quechua, which is a straightforward resource contribution worth documenting.

read the letter

The main point is that the authors collected 66 hours of Puno Quechua speech (36 hours transcribed and validated) through a participatory campaign, ran a basic benchmark on Whisper, wav2vec2, and XLS-R with and without continued pre-training, and released the data and models. That fills a clear gap for this specific variety.

The participatory approach and open release are the parts that work well. They match the stated goal of community-centered resources and give others something concrete to build on. The claim of being the first systematic benchmark for this variety holds on the paper's own terms since no prior work is cited for it.

The soft spot is the absence of any quantitative details on transcription accuracy, speaker demographics, data splits, or actual model error rates. Without those, it's hard to judge how usable the corpus really is for downstream work or whether the fine-tuning experiments show meaningful gains. The abstract alone leaves the soundness of the benchmark claim thin, though the stress-test note is right that the core existence claim does not require those numbers.

This is for people working on low-resource ASR or language documentation in Andean languages. A reader focused on data collection methods or specific language varieties would find the release useful. It is not a methodological advance, but resource papers like this still matter for preservation.

I would send it to peer review. The data release itself justifies the effort even if the technical results need more detail in revision.

Referee Report

0 major / 1 minor

Summary. The manuscript claims to introduce the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): a 66-hour speech corpus (36 hours manually transcribed and validated) collected via participatory design for scripted and spontaneous speech; the first systematic ASR benchmark evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M with and without continued pre-training; and the open release of all datasets and fine-tuned models.

Significance. If the resources exist as described, the work is significant for under-resourced language preservation through community-centered data collection and open science practices. The participatory design campaign and explicit open release of datasets and models are explicit strengths that enable reproducibility and downstream research on Quechua varieties.

minor comments (1)

[Abstract] Abstract: the benchmark description states that models were evaluated and fine-tuned but provides no quantitative results, error analysis, or data-split details; adding a brief summary of key metrics would improve reader assessment of the benchmark's scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, recognition of the significance of the participatory data collection and open release practices, and recommendation to accept the manuscript.

Circularity Check

0 steps flagged

No significant circularity; empirical resource paper

full rationale

The paper describes participatory data collection for a 66-hour speech corpus (36 hours transcribed), followed by ASR benchmarking of Whisper, wav2vec2, and XLS-R models with and without continued pre-training, plus open release of datasets and models. No mathematical derivations, predictions, fitted parameters, or uniqueness theorems appear in the abstract or described content. All claims rest on the existence and release of the collected resources and standard fine-tuning/evaluation procedures, which are self-contained empirical steps without reduction to prior self-citations or definitional loops. This matches the default non-circular case for resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the value of participatory design for producing usable resources and on the assumption that open release will enable further use; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Participatory design involving community members produces higher-quality and more appropriate language resources than non-participatory methods.
Invoked to justify the data collection campaign described in the abstract.

pith-pipeline@v0.9.1-grok · 5665 in / 1211 out tokens · 42571 ms · 2026-06-29T13:22:51.571003+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages

[1]

In23rd Annual Confer- ence of the International Speech Communication As- sociation, Interspeech 2022, Incheon, Korea, Septem- ber 18-22, 2022, pages 2278–2282

XLS-R: self-supervised cross-lingual speech repre- sentation learning at scale. In23rd Annual Confer- ence of the International Speech Communication As- sociation, Interspeech 2022, Incheon, Korea, Septem- ber 18-22, 2022, pages 2278–2282. ISCA. Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer- izo, and Luis Camacho

2022
[2]

semi-supervised training.arXiv preprint arXiv:2207.00659

Im- proving low-resource speech recognition with pre- trained speech models: Continued pretraining vs. semi-supervised training.arXiv preprint arXiv:2207.00659. Candace Galla

work page arXiv
[3]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

Omnilingual ASR: open- source multilingual speech recognition for 1600+ lan- guages.CoRR, abs/2511.09690. Hillary Mutisya and John Mugane

work page arXiv
[4]

Continued pretraining for low-resource swahili asr: Achieving state-of-the-art performance with minimal labeled data.arXiv preprint arXiv:2603.11378. Rosa Y . G. Paccotacya-Yanque, Candy A. Huanca- Anquise, Judith Escalante-Calcina, Wilber R. Ramos- Lovón, and Álvaro E. Cuno-Parari

work page arXiv
[5]

Alfredo Torero

The methodology of participatory design.Technical communication, 52(2):163–174. Alfredo Torero. 2002.Idiomas de los Andes. Lingüística e historia. Editorial horizonte. Petti Ulla, M. Claus Hannah, Barford Anna, Sadek Malak, Reichart Roi, and Korhonen Anna

2002

[1] [1]

In23rd Annual Confer- ence of the International Speech Communication As- sociation, Interspeech 2022, Incheon, Korea, Septem- ber 18-22, 2022, pages 2278–2282

XLS-R: self-supervised cross-lingual speech repre- sentation learning at scale. In23rd Annual Confer- ence of the International Speech Communication As- sociation, Interspeech 2022, Incheon, Korea, Septem- ber 18-22, 2022, pages 2278–2282. ISCA. Ronald Cardenas, Rodolfo Zevallos, Reynaldo Baquer- izo, and Luis Camacho

2022

[2] [2]

semi-supervised training.arXiv preprint arXiv:2207.00659

Im- proving low-resource speech recognition with pre- trained speech models: Continued pretraining vs. semi-supervised training.arXiv preprint arXiv:2207.00659. Candace Galla

work page arXiv

[3] [3]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

Omnilingual ASR: open- source multilingual speech recognition for 1600+ lan- guages.CoRR, abs/2511.09690. Hillary Mutisya and John Mugane

work page arXiv

[4] [4]

Continued pretraining for low-resource swahili asr: Achieving state-of-the-art performance with minimal labeled data.arXiv preprint arXiv:2603.11378. Rosa Y . G. Paccotacya-Yanque, Candy A. Huanca- Anquise, Judith Escalante-Calcina, Wilber R. Ramos- Lovón, and Álvaro E. Cuno-Parari

work page arXiv

[5] [5]

Alfredo Torero

The methodology of participatory design.Technical communication, 52(2):163–174. Alfredo Torero. 2002.Idiomas de los Andes. Lingüística e historia. Editorial horizonte. Petti Ulla, M. Claus Hannah, Barford Anna, Sadek Malak, Reichart Roi, and Korhonen Anna

2002