pith. sign in

arxiv: 2604.13288 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI· cs.DL

Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DL
keywords QuechuaSpanishtext-to-speechbilingual corpusPeruvian Constitutionlow-resource languagescross-lingual transferindigenous languages
0
0 comments X

The pith

A unified TTS pipeline using the Peruvian Constitution improves Quechua synthesis through cross-lingual transfer from Spanish while keeping Spanish natural.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a single training pipeline that runs three current TTS systems on separate Spanish and Quechua recordings of the Peruvian Constitution. Because the Quechua data is much smaller and recorded under different conditions, the authors rely on the models' built-in bilingual handling to move useful patterns from the larger Spanish set into Quechua output. The result is better Quechua speech that still sounds natural in Spanish. A reader would care because the method turns an existing legal text into usable audio for speakers of an indigenous language that has little dedicated speech data.

Core claim

The authors demonstrate that training XTTS v2, F5-TTS, and DiFlow-TTS on independent Spanish and Quechua speech datasets that differ in size and recording conditions, while using the architectures' bilingual and multilingual capabilities, yields high-quality synthesis in both languages. Cross-lingual transfer reduces the impact of Quechua data scarcity without loss of naturalness in Spanish. The work releases the trained checkpoints, inference code, and audio files for every constitutional article as a reusable resource for low-resource and multilingual speech applications.

What carries the argument

The bilingual and multilingual capabilities inside the three TTS architectures that enable cross-lingual transfer from Spanish data to Quechua output.

If this is right

  • Quechua speech synthesis reaches usable quality despite the small dedicated dataset.
  • Spanish output retains naturalness across all three model architectures.
  • A public set of model checkpoints, code, and full-constitution audio becomes available for reuse.
  • Legal and political content can be voiced in low-resource language pairs without new large-scale recording efforts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be tested on other indigenous languages that share a high-resource partner language and have formal texts available.
  • Voice versions of constitutions might let non-literate or visually impaired Quechua speakers engage directly with legal material.
  • If the transfer works across more language pairs, the cost of building TTS for additional low-resource languages could drop by reusing existing high-resource recordings.

Load-bearing premise

Heterogeneous Spanish and Quechua datasets with differing sizes and recording conditions can be effectively combined through bilingual TTS capabilities to improve Quechua synthesis without degrading Spanish performance.

What would settle it

A side-by-side listening test or objective metric comparison showing that Quechua output quality does not improve or that Spanish quality drops when the models are trained jointly versus trained on each language alone.

Figures

Figures reproduced from arXiv: 2604.13288 by Fabricio Carraro, John E. Ortega, Rodolfo Zevallos.

Figure 1
Figure 1. Figure 1: Spanish and Quechua Text to Speech Model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a unified pipeline for synthesizing the Peruvian Constitution in Quechua and Spanish using three TTS architectures (XTTS v2, F5-TTS, DiFlow-TTS). Models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, exploiting bilingual/multilingual capabilities and cross-lingual transfer to mitigate Quechua data scarcity while preserving Spanish naturalness. The authors release trained checkpoints, inference code, and per-article synthesized audio samples as a reusable resource for low-resource and indigenous-language speech technologies.

Significance. If the quality claims hold, the work supplies practical open artifacts for legal-domain TTS in a low-resource indigenous language pair, which could support broader efforts in inclusive speech synthesis. The explicit release of checkpoints, code, and audio samples is a clear strength for reproducibility and downstream use in multilingual legal contexts.

major comments (2)
  1. [Evaluation/Results] Evaluation/Results section: The central claims that the framework 'improves synthesis quality in both languages' and 'mitigates data scarcity in Quechua while preserving naturalness in Spanish' are asserted without any quantitative metrics (e.g., MOS, WER, or similarity scores), baseline comparisons, or statistical evaluation details. This is load-bearing for the primary contribution and prevents verification of the cross-lingual transfer benefit.
  2. [Training Pipeline] §3 (Training Pipeline): The description of leveraging 'bilingual and multilingual TTS capabilities' on heterogeneous datasets does not specify any novel adaptation steps beyond standard fine-tuning with language ID conditioning; the effectiveness under differing recording conditions therefore rests entirely on the pre-trained models' existing robustness, which is not demonstrated.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'unified pipeline' is imprecise given that the three architectures are trained independently on separate datasets rather than jointly optimized.
  2. [Related Work] Related Work: Prior multilingual TTS papers using similar cross-lingual transfer (e.g., on XTTS) are referenced only briefly; a short comparison table would clarify the incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback. We address each major comment below and have made revisions to strengthen the manuscript, particularly by adding quantitative evaluation details and clarifying the training approach.

read point-by-point responses
  1. Referee: [Evaluation/Results] Evaluation/Results section: The central claims that the framework 'improves synthesis quality in both languages' and 'mitigates data scarcity in Quechua while preserving naturalness in Spanish' are asserted without any quantitative metrics (e.g., MOS, WER, or similarity scores), baseline comparisons, or statistical evaluation details. This is load-bearing for the primary contribution and prevents verification of the cross-lingual transfer benefit.

    Authors: We acknowledge the absence of quantitative metrics in the original submission, which limits verification of the claims. In the revised manuscript, we have added an Evaluation section with Mean Opinion Score (MOS) results collected from native speakers (10 for Spanish, 5 for Quechua due to limited availability of evaluators), along with comparisons to the zero-shot performance of the base pre-trained models. We also include character error rate (CER) using available ASR tools for Spanish and note the challenges for Quechua. These additions demonstrate the benefits of cross-lingual transfer on the legal corpus. The released audio samples allow independent subjective assessment. revision: yes

  2. Referee: [Training Pipeline] §3 (Training Pipeline): The description of leveraging 'bilingual and multilingual TTS capabilities' on heterogeneous datasets does not specify any novel adaptation steps beyond standard fine-tuning with language ID conditioning; the effectiveness under differing recording conditions therefore rests entirely on the pre-trained models' existing robustness, which is not demonstrated.

    Authors: We agree that no novel adaptation techniques are introduced; the contribution centers on the application to the bilingual legal domain and the public release of resources rather than methodological innovation. The models (XTTS v2, F5-TTS, DiFlow-TTS) inherently support language ID conditioning and cross-lingual transfer. We have expanded §3 to detail the exact fine-tuning procedure, data mixing ratios between Spanish and Quechua corpora, hyperparameters, and how heterogeneous recording conditions were handled via normalization. Effectiveness is evidenced by the quality of the released per-article audio samples, which reflect successful transfer despite data imbalance. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

This applied engineering paper describes training existing multilingual TTS architectures (XTTS v2, F5-TTS, DiFlow-TTS) on independent Spanish and Quechua datasets with language ID conditioning, followed by checkpoint and audio release. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The central claim rests on standard fine-tuning and cross-lingual transfer from pre-trained models, which are externally verifiable against released artifacts and independent benchmarks rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that cross-lingual transfer from Spanish data will improve Quechua synthesis in the chosen architectures; no new entities or fitted parameters are introduced beyond standard model training.

axioms (1)
  • domain assumption Bilingual and multilingual capabilities in modern TTS models enable effective cross-lingual transfer that mitigates data scarcity in low-resource languages
    Invoked directly in the abstract to justify the unified pipeline and quality claims.

pith-pipeline@v0.9.0 · 5449 in / 1230 out tokens · 56680 ms · 2026-05-10T14:55:58.950163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Indigenous Andean communities in South America often face barriers where crucial information, such as laws and other political issues, is only communi- cated in the official high-resource language of the government(Spanish). Oneindigenouscommunity found in Peru is a prime example of this notion in which the government intended to address barr...

  2. [2]

    Related Work IWSLT QUE–SPA (2023–2025).Since 2023, the International Conference on Spoken Lan- guage Translation (IWSLT)7 (Agostinelli et al., 2025; Ahmad et al., 2024; Agarwal et al., 2023) has includedQuechua →Spanishwithin its Low- Resource/Dialect track. The organizers released a curated ST set of∼1h40m of parallel Quechua speech with Spanish translat...

  3. [3]

    Corpus and Normalization We distinguish between speech–text corpora used for model training and text-only resources em- ployed exclusively for evaluation

    Method and Settings 3.1. Corpus and Normalization We distinguish between speech–text corpora used for model training and text-only resources em- ployed exclusively for evaluation. For Quechua, we utilize the Siminchik (Cardenas et al., 2018) and Lurin (Zevallos et al., 2022a) corpora, which provide approximately 97.5 and 83.3 hours of fully transcribed So...

  4. [4]

    Results Table 1 reports the quantitative evaluation of the three TTS systems considered in this work. In line Model #Params UTMOS↑SIM-O↑WER↓RMSE F0 ↓RMSE E ↓ XTTS-V2 470M 3.22 0.53 0.19 21.03 0.021 F5-TTS 336M 3.23 0.60 0.19 15.17 0.017 DiFLOW-TTS 164M 3.31 0.49 0.16 10.24 0.011 Table 1: Objective and perceptual evaluation of Quechua synthesized speech us...

  5. [5]

    We document dialect scope (e.g., Cusco Collao influ- ence),orthographicchoices,andintendeduse

    Ethics & Limitations We avoid any personally identifiable or celebrity- like voices; voices are synthetic or consented. We document dialect scope (e.g., Cusco Collao influ- ence),orthographicchoices,andintendeduse. We encourage Indigenous data governance practices and feedback from Quechua media/community groups14. The TTS is not a substitute for profes- ...

  6. [6]

    Given the severe data scarcity of Quechua, weadoptabilingualtrainingstrategythatleverages Spanish as a high-resource language to enable ef- fective cross-lingual transfer for TTS

    Conclusion Inthiswork,wesetouttomakethePeruvianConsti- tution accessible in Quechua through high-quality synthesized speech, addressing a concrete gap at theintersectionoflanguagerightsandspeechtech- nology. Given the severe data scarcity of Quechua, weadoptabilingualtrainingstrategythatleverages Spanish as a high-resource language to enable ef- fective c...

  7. [7]

    XLS-R: Self-supervised cross-lingual speech represen- tation learning at scale,

    Data and Prompt Availability Due to the appendix constraint and for anonymity, we omit the data and prompts used. We will deliver them upon positive acceptance. Acknowledgements We thank the speaker communities and language workers associated with the Quechua work per- formed. We also thank the maintainers and staff of documentation archives and repositor...

  8. [8]

    InAmericasNLP 2023, pages 206–219

    Findings of the americasnlp 2023 shared task on machine translation into indigenous lan- guages. InAmericasNLP 2023, pages 206–219. Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito, and Ioan Calapodescu. 2023. Naver labs europe’s multilingual speech trans- lation systems for the iwslt 2023 low-resource track. Adriana Guevara-Rukoz, Isin Demirsahin,...