pith. machine review for the scientific record. sign in

arxiv: 2604.25441 · v1 · submitted 2026-04-28 · 💻 cs.SD · cs.CL· eess.AS

Recognition: unknown

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:25 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS
keywords TTSIndic TTSLoRAvoice prompt recoveryphoneme romanizationfrozen modelcommercial baselinecode-mixing
0
0 comments X

The pith

A frozen non-Indic TTS base reaches commercial phonological accuracy on Telugu, Tamil and Hindi with romanization, a small LoRA, and prompt overrides.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks what minimum changes let a multilingual base model that lacks native Indic tokenization produce output matching commercial systems on Telugu, Tamil, and Hindi. It shows that a deterministic romanization layer, a LoRA adapter trained only on the text predictor using licensed audio, and a fixed voice-prompt recipe with three sampling changes suffice, without touching the acoustic decoder or any commercial data. This matters because it removes the need for expensive proprietary datasets and full model retraining while still delivering low rates of phoneme collapse and word errors. The result is a practical, licensable route to Indic TTS that keeps the base model frozen.

Core claim

BUPS romanizes seven Indic scripts to ISO-15919 so the base model's Latin tokeniser can process them; a LoRA adapter on only the t3 text-token predictor is trained on roughly 1,220 hours of licensed Indic audio under a Hindi-proxy language ID; and an 8-11 second same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1, called Config B) recovers commercial-class acoustic output. On pilot sets the combination matches or slightly exceeds commercial baselines on retroflex collapse (26.7 % Telugu vs 33.3 %), Tamil zha collapse (71 % vs 86 %), and Hindi LLM-WER (0.025). Hindi uses the vanilla base plus Config B; code-mixed speech uses a third IndicF5

What carries the argument

The voice-prompt recovery recipe (same-language 8-11 s reference clip plus Config B sampling overrides) that restores commercial acoustic fidelity without any acoustic-decoder training, supported by BUPS romanization and a targeted LoRA.

If this is right

  • Commercial-class Indic TTS output is achievable from a frozen non-Indic base without acoustic decoder retraining or commercial training data.
  • A two-branch deployment is required: LoRA-adapted path for Telugu and Tamil, vanilla base plus Config B for Hindi.
  • A third IndicF5 branch with native-script transliteration reduces code-mix LLM-WER from 0.80-0.85 to 0.14-0.27.
  • The released R6 LoRA weights, inference code, router, and demo make the recipe immediately reproducible.
  • Phonological metrics on the PSP benchmark are at least as good as the commercial trio while using only licensed data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt-level recovery can compensate for missing phoneme coverage in base models without full fine-tuning of the acoustic stack.
  • The same pattern may extend to other languages absent from the base model's tokenizer.
  • Lowering data barriers could speed development of open Indic voice systems that respect licensing constraints.
  • Verification on longer utterances and varied domains is needed before claiming broad stability.

Load-bearing premise

That results observed on 10-utterance pilot sets with the PSP benchmark will hold for real-world speakers, domains, and longer utterances.

What would settle it

A larger, speaker-diverse test set in which Telugu retroflex collapse exceeds 33 %, Tamil zha collapse exceeds 86 %, or Hindi LLM-WER rises clearly above 0.025 would falsify the claim that the method matches or leads commercial performance.

Figures

Figures reproduced from arXiv: 2604.25441 by Venkata Pushpak Teja Menta.

Figure 1
Figure 1. Figure 1: Praxy Voice three-branch inference pipeline. Te/Ta pure￾script routes through LoRA branch (BUPS romaniser → LoRA￾adapted Chatterbox t3); Hi pure-script routes through vanilla branch (unchanged Chatterbox t3); both converge on frozen s3gen with the voice-prompt + Config B recipe. Code-mixed inputs (any target language with ≥1 Latin word ≥2 chars) route through the code-mix branch (§III-E): a Haiku-driven na… view at source ↗
read the original abstract

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Praxy Voice as a minimal-intervention approach to achieve commercial-class Indic TTS (Telugu, Tamil, Hindi) from the frozen Chatterbox base model. It introduces BUPS for deterministic romanization of Indic scripts to ISO-15919, trains a LoRA adapter solely on the text-token predictor using ~1,220h licensed audio, and employs a voice-prompt recovery recipe with Config B sampling overrides. A two-branch deployment is used (LoRA for Telugu/Tamil, vanilla for Hindi), with a third branch for code-mixing via IndicF5. On 10-utterance PSP benchmark pilots, it reports matching or superior phonological metrics (e.g., lower retroflex/zha collapse rates, low LLM-WER) compared to commercial baselines, without acoustic decoder training or commercial data. The work releases the LoRA weights, inference code, and a demo.

Significance. If validated with larger-scale evaluations, this contribution would be significant for resource-efficient adaptation of multilingual TTS models to Indic languages, demonstrating that targeted text-side adaptation and prompt engineering can close the gap to commercial systems. The open release of weights (Apache-2.0) and code (MIT) enhances reproducibility and is a notable strength. The approach avoids the need for large-scale commercial training data, which is a practical advantage.

major comments (2)
  1. [Evaluation] The central claims of matching or leading commercial baselines (e.g., 26.7% retroflex collapse on Telugu vs. 33.3%, 71% Tamil-zha vs. 86%, 0.025 LLM-WER on Hindi) rest on 10-utterance pilot sets. No error bars, statistical tests, utterance selection protocol, or speaker/domain diversity details are provided, rendering the reported improvements statistically fragile and potentially attributable to sampling noise.
  2. [Deployment Strategy] The post-hoc selection of Config B sampling parameters (exaggeration 0.7, temperature 0.6, min_p 0.1) and the two-branch (LoRA vs. vanilla) deployment for Hindi, where LoRA regresses performance, indicate that the method is not uniformly effective across languages and may require per-language tuning, which undermines the 'minimum intervention' claim.
minor comments (1)
  1. [Abstract] The abstract mentions 'three sampling overrides' but does not specify them until later; including the values (exaggeration 0.7, temperature 0.6, min_p 0.1) in the abstract would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for highlighting the evaluation limitations and deployment considerations. We provide clarifications and partial revisions where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] The central claims of matching or leading commercial baselines (e.g., 26.7% retroflex collapse on Telugu vs. 33.3%, 71% Tamil-zha vs. 86%, 0.025 LLM-WER on Hindi) rest on 10-utterance pilot sets. No error bars, statistical tests, utterance selection protocol, or speaker/domain diversity details are provided, rendering the reported improvements statistically fragile and potentially attributable to sampling noise.

    Authors: The evaluations are explicitly presented as 10-utterance pilots on the PSP benchmark to illustrate the approach's viability. We agree that additional details on utterance selection are needed. In revision, we will add a section describing the protocol: utterances were chosen to include target phonological features (e.g., word-initial and medial retroflexes for Telugu, zha phonemes for Tamil, intra-sentential code-mixes for Hindi) from multiple public-domain speakers and domains (news, stories). Speaker/domain diversity is limited but spans 3-5 speakers per language. We do not provide error bars or statistical tests because the sample size precludes meaningful inference; these results are not claimed to be statistically significant but serve as a proof-of-concept. We will revise the manuscript to emphasize the pilot nature and include the selection details. revision: partial

  2. Referee: [Deployment Strategy] The post-hoc selection of Config B sampling parameters (exaggeration 0.7, temperature 0.6, min_p 0.1) and the two-branch (LoRA vs. vanilla) deployment for Hindi, where LoRA regresses performance, indicate that the method is not uniformly effective across languages and may require per-language tuning, which undermines the 'minimum intervention' claim.

    Authors: Config B was selected once via validation on a small held-out Indic set and then fixed globally; it is not post-hoc per language. The two-branch deployment is a practical choice for optimal performance: the LoRA adapter improves Telugu and Tamil but causes regression on Hindi (likely due to the proxy language_id training), so the vanilla branch is used for Hindi. This is implemented as a simple language router in the released code, without any additional tuning. The minimum intervention claim refers to the training process—no acoustic decoder training, no commercial data, uniform BUPS and prompt recovery recipe. The deployment router does not require per-language hyperparameter search beyond the initial branch decision based on base model support. We disagree that this undermines the claim and make no revision on this point. revision: no

standing simulated objections not resolved
  • The current pilot evaluations on 10 utterances do not allow for robust statistical validation of the claims; expanding to larger test sets would be necessary to fully address concerns about sampling noise and fragility.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's method rests on three independent components: a deterministic BUPS romanization of Indic scripts to ISO-15919, a LoRA adapter trained on ~1,220 h of separately licensed Indic audio, and an empirically stated voice-prompt recipe (8-11 s reference clip plus fixed Config B sampling values) applied to a frozen base model. Claims of matching commercial baselines are supported by direct evaluation on external PSP benchmark pilot sets against named third-party systems (Sarvam Bulbul, Cartesia Sonic-3), with no self-citations, no fitted parameters renamed as predictions, and no equations that reduce outputs to inputs by construction. The derivation chain therefore remains self-contained against external data and benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim depends on the effectiveness of the newly introduced BUPS mapping and voice-prompt recovery procedure, plus the assumption that a small LoRA on the text predictor suffices to adapt phonology without acoustic retraining.

free parameters (2)
  • Config B sampling overrides
    Exaggeration 0.7, temperature 0.6, min_p 0.1 chosen to recover commercial-class acoustics; values appear selected rather than derived.
  • LoRA rank and training schedule
    Specific hyperparameters for the ~1,220 h training run are not stated in the abstract.
axioms (2)
  • domain assumption Deterministic romanization via BUPS preserves all necessary phonological distinctions for the Latin tokeniser
    Invoked to justify processing Telugu, Tamil, and Hindi without script-specific tokeniser changes.
  • domain assumption A LoRA adapter on the text-token predictor alone can correct Indic phonology while leaving the frozen acoustic decoder untouched
    Core premise allowing zero acoustic-decoder training.
invented entities (2)
  • BUPS (Brahmic Unified Phoneme Space) no independent evidence
    purpose: Deterministic romanization of seven Indic scripts to ISO-15919
    Newly defined mapping introduced to enable the base model's tokeniser to handle Indic input.
  • Voice-prompt recovery recipe with Config B no independent evidence
    purpose: Recover commercial acoustic quality using 8-11 s reference clip and fixed sampling overrides
    New procedure proposed to avoid acoustic decoder training.

pith-pipeline@v0.9.0 · 5727 in / 1837 out tokens · 75635 ms · 2026-05-07T14:25:09.680787+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

    cs.SD 2026-05 unverdicted novelty 6.0

    LASE eliminates the script-induced drop in speaker similarity (from 0.08-0.1 down to near zero) by training a language-adversarial projection head on top of frozen WavLM using synthesized cross-script pairs.

Reference graph

Works this paper leans on

20 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    PSP: An interpretable per-dimension accent bench- mark for Indic text-to-speech,

    P. Teja, “PSP: An interpretable per-dimension accent bench- mark for Indic text-to-speech,” 2026, arXiv preprint, 2026. Companion paper submitted separately

  2. [2]

    Indic Parler-TTS: Open- source TTS for 20 Indic languages,

    AI4Bharat and Y. Lacombe, “Indic Parler-TTS: Open- source TTS for 20 Indic languages,” https://huggingface.co/ ai4bharat/indic-parler-tts , 2024

  3. [3]

    IndicF5: Flow-matching TTS for 11 Indic lan- guages,

    AI4Bharat, “IndicF5: Flow-matching TTS for 11 Indic lan- guages,” https://huggingface.co/ai4bharat/IndicF5, 2024

  4. [4]

    OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    k2-fsa Team, “OmniVoice: Towards omnilingual zero-shot text- to-speech with diffusion language models,” arXiv:2604.00688, 2026, model at https://huggingface.co/k2-fsa/OmniVoice

  5. [5]

    A2TTS: TTS for low-resource Indian languages,

    A. S. Bhadoriya et al. , “A2TTS: TTS for low-resource Indian languages,” arXiv:2507.15272, 2025

  6. [6]

    Chatterbox multilingual: Open-source TTS for 23 languages,

    Resemble AI, “Chatterbox multilingual: Open-source TTS for 23 languages,” https://github.com/resemble-ai/chatterbox, 2025, model weights at https://huggingface.co/ResembleAI/ chatterbox, accessed 2026-04

  7. [7]

    ISO 15919: Transliteration of Devanagari and related Indic scripts into Latin characters,

    International Organization for Standardization, “ISO 15919: Transliteration of Devanagari and related Indic scripts into Latin characters,” Standards document, 2001

  8. [8]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022, arXiv:2106.09685

  9. [9]

    VoxCPM2: Tokeniser-free TTS for multilingual speech generation,

    OpenBMB, “VoxCPM2: Tokeniser-free TTS for multilingual speech generation,” https://github.com/OpenBMB/VoxCPM, 2025

  10. [10]

    indic-transliteration: Python package for Indic script transliteration,

    M. contributors, “ indic-transliteration: Python package for Indic script transliteration,” https://github.com/ indic-transliteration/indic_transliteration_py, 2024

  11. [11]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y. Chen et al. , “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” arXiv:2410.06885, 2024

  12. [12]

    Quantifying speaker embedding phonolog- ical rule interactions in accented speech synthesis,

    T. Lertpetchpun et al., “Quantifying speaker embedding phono- logical rule interactions in accented speech synthesis,” in IEEE ICASSP, 2026, arXiv:2601.14417

  13. [13]

    PEFT: State-of-the-art parameter-efficient fine- tuning methods,

    HuggingFace, “PEFT: State-of-the-art parameter-efficient fine- tuning methods,” https://github.com/huggingface/peft, 2024

  14. [14]

    Chatterbox fine-tuning mul- tilingual,

    Ahmed-Ezzat20, “Chatterbox fine-tuning mul- tilingual,” https://github.com/Ahmed-Ezzat20/ chatterbox-finetuning-multilingual , 2025

  15. [15]

    Towards building text-to-speech systems for the next billion users,

    G. Kumar et al. , “Towards building text-to-speech systems for the next billion users,” in ICASSP, 2023

  16. [16]

    Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,

    A. Sankar et al. , “Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,” in Interspeech, 2025

  17. [17]

    FLEURS: Few- shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few- shot learning evaluation of universal representations of speech,” in SLT, 2022

  18. [18]

    Bulbul-v3: An indic multilingual tts with uni- fied native-script tokenisation,

    Sarvam AI, “Bulbul-v3: An indic multilingual tts with uni- fied native-script tokenisation,” https://www.sarvam.ai/blogs/ bulbul-v3, Feb. 2026, technical blog post

  19. [19]

    XLS-R: Self-supervised cross-lingual speech representation learning at scale,

    A. Babu, C. Wang, A. Tjandra et al. , “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” Inter- speech, 2022

  20. [20]

    Karya: Crowdsourcing platform for indian lan- guages,

    Karya Team, “Karya: Crowdsourcing platform for indian lan- guages,” https://karya.in, 2025