Recognition: unknown
Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost
Pith reviewed 2026-05-07 14:25 UTC · model grok-4.3
The pith
A frozen non-Indic TTS base reaches commercial phonological accuracy on Telugu, Tamil and Hindi with romanization, a small LoRA, and prompt overrides.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BUPS romanizes seven Indic scripts to ISO-15919 so the base model's Latin tokeniser can process them; a LoRA adapter on only the t3 text-token predictor is trained on roughly 1,220 hours of licensed Indic audio under a Hindi-proxy language ID; and an 8-11 second same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1, called Config B) recovers commercial-class acoustic output. On pilot sets the combination matches or slightly exceeds commercial baselines on retroflex collapse (26.7 % Telugu vs 33.3 %), Tamil zha collapse (71 % vs 86 %), and Hindi LLM-WER (0.025). Hindi uses the vanilla base plus Config B; code-mixed speech uses a third IndicF5
What carries the argument
The voice-prompt recovery recipe (same-language 8-11 s reference clip plus Config B sampling overrides) that restores commercial acoustic fidelity without any acoustic-decoder training, supported by BUPS romanization and a targeted LoRA.
If this is right
- Commercial-class Indic TTS output is achievable from a frozen non-Indic base without acoustic decoder retraining or commercial training data.
- A two-branch deployment is required: LoRA-adapted path for Telugu and Tamil, vanilla base plus Config B for Hindi.
- A third IndicF5 branch with native-script transliteration reduces code-mix LLM-WER from 0.80-0.85 to 0.14-0.27.
- The released R6 LoRA weights, inference code, router, and demo make the recipe immediately reproducible.
- Phonological metrics on the PSP benchmark are at least as good as the commercial trio while using only licensed data.
Where Pith is reading between the lines
- Prompt-level recovery can compensate for missing phoneme coverage in base models without full fine-tuning of the acoustic stack.
- The same pattern may extend to other languages absent from the base model's tokenizer.
- Lowering data barriers could speed development of open Indic voice systems that respect licensing constraints.
- Verification on longer utterances and varied domains is needed before claiming broad stability.
Load-bearing premise
That results observed on 10-utterance pilot sets with the PSP benchmark will hold for real-world speakers, domains, and longer utterances.
What would settle it
A larger, speaker-diverse test set in which Telugu retroflex collapse exceeds 33 %, Tamil zha collapse exceeds 86 %, or Hindi LLM-WER rises clearly above 0.025 would falsify the claim that the method matches or leads commercial performance.
Figures
read the original abstract
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Praxy Voice as a minimal-intervention approach to achieve commercial-class Indic TTS (Telugu, Tamil, Hindi) from the frozen Chatterbox base model. It introduces BUPS for deterministic romanization of Indic scripts to ISO-15919, trains a LoRA adapter solely on the text-token predictor using ~1,220h licensed audio, and employs a voice-prompt recovery recipe with Config B sampling overrides. A two-branch deployment is used (LoRA for Telugu/Tamil, vanilla for Hindi), with a third branch for code-mixing via IndicF5. On 10-utterance PSP benchmark pilots, it reports matching or superior phonological metrics (e.g., lower retroflex/zha collapse rates, low LLM-WER) compared to commercial baselines, without acoustic decoder training or commercial data. The work releases the LoRA weights, inference code, and a demo.
Significance. If validated with larger-scale evaluations, this contribution would be significant for resource-efficient adaptation of multilingual TTS models to Indic languages, demonstrating that targeted text-side adaptation and prompt engineering can close the gap to commercial systems. The open release of weights (Apache-2.0) and code (MIT) enhances reproducibility and is a notable strength. The approach avoids the need for large-scale commercial training data, which is a practical advantage.
major comments (2)
- [Evaluation] The central claims of matching or leading commercial baselines (e.g., 26.7% retroflex collapse on Telugu vs. 33.3%, 71% Tamil-zha vs. 86%, 0.025 LLM-WER on Hindi) rest on 10-utterance pilot sets. No error bars, statistical tests, utterance selection protocol, or speaker/domain diversity details are provided, rendering the reported improvements statistically fragile and potentially attributable to sampling noise.
- [Deployment Strategy] The post-hoc selection of Config B sampling parameters (exaggeration 0.7, temperature 0.6, min_p 0.1) and the two-branch (LoRA vs. vanilla) deployment for Hindi, where LoRA regresses performance, indicate that the method is not uniformly effective across languages and may require per-language tuning, which undermines the 'minimum intervention' claim.
minor comments (1)
- [Abstract] The abstract mentions 'three sampling overrides' but does not specify them until later; including the values (exaggeration 0.7, temperature 0.6, min_p 0.1) in the abstract would improve clarity.
Simulated Author's Rebuttal
We thank the referee for highlighting the evaluation limitations and deployment considerations. We provide clarifications and partial revisions where appropriate.
read point-by-point responses
-
Referee: [Evaluation] The central claims of matching or leading commercial baselines (e.g., 26.7% retroflex collapse on Telugu vs. 33.3%, 71% Tamil-zha vs. 86%, 0.025 LLM-WER on Hindi) rest on 10-utterance pilot sets. No error bars, statistical tests, utterance selection protocol, or speaker/domain diversity details are provided, rendering the reported improvements statistically fragile and potentially attributable to sampling noise.
Authors: The evaluations are explicitly presented as 10-utterance pilots on the PSP benchmark to illustrate the approach's viability. We agree that additional details on utterance selection are needed. In revision, we will add a section describing the protocol: utterances were chosen to include target phonological features (e.g., word-initial and medial retroflexes for Telugu, zha phonemes for Tamil, intra-sentential code-mixes for Hindi) from multiple public-domain speakers and domains (news, stories). Speaker/domain diversity is limited but spans 3-5 speakers per language. We do not provide error bars or statistical tests because the sample size precludes meaningful inference; these results are not claimed to be statistically significant but serve as a proof-of-concept. We will revise the manuscript to emphasize the pilot nature and include the selection details. revision: partial
-
Referee: [Deployment Strategy] The post-hoc selection of Config B sampling parameters (exaggeration 0.7, temperature 0.6, min_p 0.1) and the two-branch (LoRA vs. vanilla) deployment for Hindi, where LoRA regresses performance, indicate that the method is not uniformly effective across languages and may require per-language tuning, which undermines the 'minimum intervention' claim.
Authors: Config B was selected once via validation on a small held-out Indic set and then fixed globally; it is not post-hoc per language. The two-branch deployment is a practical choice for optimal performance: the LoRA adapter improves Telugu and Tamil but causes regression on Hindi (likely due to the proxy language_id training), so the vanilla branch is used for Hindi. This is implemented as a simple language router in the released code, without any additional tuning. The minimum intervention claim refers to the training process—no acoustic decoder training, no commercial data, uniform BUPS and prompt recovery recipe. The deployment router does not require per-language hyperparameter search beyond the initial branch decision based on base model support. We disagree that this undermines the claim and make no revision on this point. revision: no
- The current pilot evaluations on 10 utterances do not allow for robust statistical validation of the claims; expanding to larger test sets would be necessary to fully address concerns about sampling noise and fragility.
Circularity Check
No significant circularity detected
full rationale
The paper's method rests on three independent components: a deterministic BUPS romanization of Indic scripts to ISO-15919, a LoRA adapter trained on ~1,220 h of separately licensed Indic audio, and an empirically stated voice-prompt recipe (8-11 s reference clip plus fixed Config B sampling values) applied to a frozen base model. Claims of matching commercial baselines are supported by direct evaluation on external PSP benchmark pilot sets against named third-party systems (Sarvam Bulbul, Cartesia Sonic-3), with no self-citations, no fitted parameters renamed as predictions, and no equations that reduce outputs to inputs by construction. The derivation chain therefore remains self-contained against external data and benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- Config B sampling overrides
- LoRA rank and training schedule
axioms (2)
- domain assumption Deterministic romanization via BUPS preserves all necessary phonological distinctions for the Latin tokeniser
- domain assumption A LoRA adapter on the text-token predictor alone can correct Indic phonology while leaving the frozen acoustic decoder untouched
invented entities (2)
-
BUPS (Brahmic Unified Phoneme Space)
no independent evidence
-
Voice-prompt recovery recipe with Config B
no independent evidence
Forward citations
Cited by 1 Pith paper
-
LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation
LASE eliminates the script-induced drop in speaker similarity (from 0.08-0.1 down to near zero) by training a language-adversarial projection head on top of frozen WavLM using synthesized cross-script pairs.
Reference graph
Works this paper leans on
-
[1]
PSP: An interpretable per-dimension accent bench- mark for Indic text-to-speech,
P. Teja, “PSP: An interpretable per-dimension accent bench- mark for Indic text-to-speech,” 2026, arXiv preprint, 2026. Companion paper submitted separately
2026
-
[2]
Indic Parler-TTS: Open- source TTS for 20 Indic languages,
AI4Bharat and Y. Lacombe, “Indic Parler-TTS: Open- source TTS for 20 Indic languages,” https://huggingface.co/ ai4bharat/indic-parler-tts , 2024
2024
-
[3]
IndicF5: Flow-matching TTS for 11 Indic lan- guages,
AI4Bharat, “IndicF5: Flow-matching TTS for 11 Indic lan- guages,” https://huggingface.co/ai4bharat/IndicF5, 2024
2024
-
[4]
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
k2-fsa Team, “OmniVoice: Towards omnilingual zero-shot text- to-speech with diffusion language models,” arXiv:2604.00688, 2026, model at https://huggingface.co/k2-fsa/OmniVoice
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
A2TTS: TTS for low-resource Indian languages,
A. S. Bhadoriya et al. , “A2TTS: TTS for low-resource Indian languages,” arXiv:2507.15272, 2025
-
[6]
Chatterbox multilingual: Open-source TTS for 23 languages,
Resemble AI, “Chatterbox multilingual: Open-source TTS for 23 languages,” https://github.com/resemble-ai/chatterbox, 2025, model weights at https://huggingface.co/ResembleAI/ chatterbox, accessed 2026-04
2025
-
[7]
ISO 15919: Transliteration of Devanagari and related Indic scripts into Latin characters,
International Organization for Standardization, “ISO 15919: Transliteration of Devanagari and related Indic scripts into Latin characters,” Standards document, 2001
2001
-
[8]
LoRA: Low-Rank Adaptation of Large Language Models
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022, arXiv:2106.09685
work page internal anchor Pith review arXiv 2022
-
[9]
VoxCPM2: Tokeniser-free TTS for multilingual speech generation,
OpenBMB, “VoxCPM2: Tokeniser-free TTS for multilingual speech generation,” https://github.com/OpenBMB/VoxCPM, 2025
2025
-
[10]
indic-transliteration: Python package for Indic script transliteration,
M. contributors, “ indic-transliteration: Python package for Indic script transliteration,” https://github.com/ indic-transliteration/indic_transliteration_py, 2024
2024
-
[11]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y. Chen et al. , “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” arXiv:2410.06885, 2024
-
[12]
Quantifying speaker embedding phonolog- ical rule interactions in accented speech synthesis,
T. Lertpetchpun et al., “Quantifying speaker embedding phono- logical rule interactions in accented speech synthesis,” in IEEE ICASSP, 2026, arXiv:2601.14417
-
[13]
PEFT: State-of-the-art parameter-efficient fine- tuning methods,
HuggingFace, “PEFT: State-of-the-art parameter-efficient fine- tuning methods,” https://github.com/huggingface/peft, 2024
2024
-
[14]
Chatterbox fine-tuning mul- tilingual,
Ahmed-Ezzat20, “Chatterbox fine-tuning mul- tilingual,” https://github.com/Ahmed-Ezzat20/ chatterbox-finetuning-multilingual , 2025
2025
-
[15]
Towards building text-to-speech systems for the next billion users,
G. Kumar et al. , “Towards building text-to-speech systems for the next billion users,” in ICASSP, 2023
2023
-
[16]
Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,
A. Sankar et al. , “Rasmalai: A large-scale Indic speech dataset with accent and intonation descriptions,” in Interspeech, 2025
2025
-
[17]
FLEURS: Few- shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few- shot learning evaluation of universal representations of speech,” in SLT, 2022
2022
-
[18]
Bulbul-v3: An indic multilingual tts with uni- fied native-script tokenisation,
Sarvam AI, “Bulbul-v3: An indic multilingual tts with uni- fied native-script tokenisation,” https://www.sarvam.ai/blogs/ bulbul-v3, Feb. 2026, technical blog post
2026
-
[19]
XLS-R: Self-supervised cross-lingual speech representation learning at scale,
A. Babu, C. Wang, A. Tjandra et al. , “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” Inter- speech, 2022
2022
-
[20]
Karya: Crowdsourcing platform for indian lan- guages,
Karya Team, “Karya: Crowdsourcing platform for indian lan- guages,” https://karya.in, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.