The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Venkata Pushpak Teja Menta

arxiv: 2605.03073 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.SD

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Venkata Pushpak Teja Menta This is my paper

Pith reviewed 2026-05-08 18:17 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords Indic ASRsynthetic dataTTS-STT flywheelentity hit ratecode-mixingLoRA fine-tuningWhisper modelscript collapse

0 comments

The pith

A TTS-STT flywheel using 22,000 synthetic Indic utterances raises entity hit rate from 0.027 to 0.473 on held-out tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that both open-source and commercial ASR systems perform poorly on Indic speech containing digits, currency, addresses, brands, and English-Indic code-mixing. It establishes that an open-source TTS pipeline can generate a large set of entity-dense synthetic utterances at minimal cost and that fine-tuning a base model on this data produces large gains in entity recognition. The resulting models maintain acceptable performance on standard read-prose benchmarks and show partial transfer to native recordings. An ablation isolates the synthetic corpus as the source of the gains, while honest reporting notes that pre-registered performance targets were not fully reached.

Core claim

The self-contained TTS-STT flywheel synthesizes roughly 22,000 entity-dense Indic-English code-mix utterances at under $50 marginal cost; LoRA fine-tuning on top of vasista22 then delivers Entity-Hit-Rate 0.473 on a held-out Telugu test set (17 times open SOTA, 3 times commercial), with cross-language results of 0.337 for Hindi and 0.543 for Tamil, limited read-prose regression, and confirmed transfer on a small native Telugu recording set.

What carries the argument

The TTS<->STT flywheel that generates an Entity-Dense Synthetic Audio (EDSA) corpus for subsequent LoRA fine-tuning of a base ASR model.

If this is right

The EDSA ablation attributes essentially 100 percent of the EHR gain to the synthetic corpus rather than the base data.
Per-language LoRA corrects Telugu script collapse in vanilla Whisper-large-v3 while the correction is unnecessary for Hindi and Tamil where vanilla performance is already high.
Cross-language beta models show 7x to 22x EHR gains versus vasista22 on Tamil and Hindi but fall short of the pre-registered 0.65 target on Hindi and Tamil.
On Hindi the flywheel underperforms a commercial system that already covers entities well, while on Telugu and Tamil it exceeds both open and commercial baselines.
Native Telugu sanity check (n=20) produces EHR 0.516, confirming that synthetic training improves recognition on real speech.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Iterating the flywheel by feeding improved ASR outputs back into TTS generation could produce higher-quality training data at still low cost.
The same low-cost synthetic pipeline could be applied to other low-resource languages or domains that suffer from entity recognition gaps.
Releasing the EDSA corpus and entity dictionaries enables independent verification and extension by other groups without requiring new recordings.

Load-bearing premise

The synthetic utterances produced by the TTS pipeline match real-world entity-dense Indic speech closely enough in acoustics and linguistics for the fine-tuned model to transfer effectively to native recordings.

What would settle it

A test of the fine-tuned beta-Te model on several hundred human-recorded entity-dense Telugu utterances that yields an EHR below 0.35 would indicate that the synthetic data does not transfer adequately.

Figures

Figures reproduced from arXiv: 2605.03073 by Venkata Pushpak Teja Menta.

**Figure 1.** Figure 1: Entity-Hit-Rate on the entity-dense Telugu held-out set (n = 102). Praxy-STT-rb closes 17× the gap over open-source SOTA and 3× over commercial. and proper_nouns classes (held-out distribution did not contain rows in those classes after class-balancing); these are reported as “—” rather than 0 to avoid implying a system failure on classes that were never tested. As view at source ↗

**Figure 2.** Figure 2: Per-language Script Fidelity Rate on CV25, across vanilla Whisper-v3, Praxy-STT-r2 (Whisper-v3 + per-language LoRA), and vasista22 (open SOTA). Vanilla v3 collapses on Telugu only; the LoRA recipe fixes Te but harms Hi/Ta; vasista22 sits at ≈ 1.0 across all three. by 20–160% relative (+19 to +69 pp absolute) and drops SFR to as low as 0.43 (Hi-IV). The recipe is therefore contraindicated outside Telugu, an… view at source ↗

read the original abstract

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a cheap TTS-STT loop that lifts entity hit rates on synthetic Indic code-mix tests by a lot and releases the data, but the real-speech transfer rests on just 20 utterances.

read the letter

The core contribution is a self-contained flywheel that generates 22k entity-dense Telugu-Hindi-Tamil code-mix utterances for under $50 and then LoRA-fine-tunes a Whisper model to reach EHR 0.473 on held-out synthetic Telugu (versus 0.027 for the open baseline). Cross-language betas show similar jumps, an ablation credits the new corpus for nearly all the gain, and they document a Telugu-specific script collapse in the base model that the per-language adapter fixes. Everything—corpus, code, predictions, entity lists—is released, which is the part that actually moves the needle for follow-up work.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that a self-contained TTS-STT flywheel—an open-source Indic TTS pipeline generating ~22,000 entity-dense Indic-English code-mix utterances at <$50 cost—combined with LoRA fine-tuning on vasista22 closes the niche-domain Indic ASR gap. It reports EHR 0.473 on a held-out synthetic Telugu test set (17x over open SOTA vasista22 at 0.027, 3x over Deepgram Nova-3 at 0.16), cross-language gains (beta-Hi 0.337, beta-Ta 0.543), read-prose regression bounded to +6.6 pp WER on FLEURS-Te, an EDSA-isolation ablation attributing ~100% of gains to the synthetic corpus, and a n=20 native Telugu sanity check (EHR 0.516). All beta models miss pre-registered EHR targets (0.75 Te, 0.65 Hi/Ta) but results are reported honestly; a language-conditional script-collapse finding for Whisper-large-v3 is also noted. Code, data, and models are released.

Significance. If the synthetic-to-real transfer generalizes, the work supplies a low-cost, reproducible recipe for improving entity-dense ASR in under-served Indic languages, with the open release of the EDSA corpus, entity dictionaries, and models constituting a clear strength. The EDSA-isolation ablation provides transparent evidence that gains derive from the synthetic data rather than other factors. The honest reporting of missed pre-registered targets and the language-conditional script-collapse observation add credibility, though the overall significance for real-world deployment remains conditional on stronger validation of transfer.

major comments (3)

[native-human-recorded sanity check] Native-human-recorded sanity check (n=20 Telugu): the claim of confirmed transfer to real speech rests on EHR 0.516 for 20 utterances versus 0.473 on synthetic. This sample cannot capture speaker/accent variation, prosody, background noise, or the full distribution of entity-dense code-mixed speech, rendering the generalization to native recordings load-bearing yet weakly supported.
[abstract and results reporting] Pre-registered EHR targets: all beta models fall below the stated targets (0.75 for Te, 0.65 for Hi/Ta), yet the abstract and title frame the work as closing the gap. This discrepancy requires explicit discussion of why the targets were missed and how the reported gains should be interpreted relative to the original goals.
[test set and EDSA corpus description] Held-out test construction and synthesis fidelity: main results (including the 17x/3x gains and EDSA ablation) are measured on synthetic data generated by the same TTS pipeline used for training. Without detailed verification that the held-out set is independent and that synthesis fidelity matches real acoustic/linguistic properties, the headline improvements risk over-optimism.

minor comments (2)

[methods] Define Entity-Hit-Rate (EHR) and Script Failure Rate (SFR) explicitly with formulas or pseudocode in the methods section for reproducibility.
[cross-language results] The cross-language Hindi result (underperformance vs. Deepgram) is noted but would benefit from a short paragraph analyzing why the flywheel succeeds on Telugu/Tamil yet not Hindi.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and limitations of our evaluation. We address each major point below with proposed revisions to enhance transparency and accuracy in the manuscript.

read point-by-point responses

Referee: [native-human-recorded sanity check] Native-human-recorded sanity check (n=20 Telugu): the claim of confirmed transfer to real speech rests on EHR 0.516 for 20 utterances versus 0.473 on synthetic. This sample cannot capture speaker/accent variation, prosody, background noise, or the full distribution of entity-dense code-mixed speech, rendering the generalization to native recordings load-bearing yet weakly supported.

Authors: We agree that the n=20 sample is too small to capture speaker, accent, prosody, or noise variation and cannot serve as robust evidence of generalization. It was included only as a preliminary sanity check to demonstrate that the model does not catastrophically fail on real speech. In the revised manuscript we will (1) explicitly label it as a limited sanity check rather than confirmation of transfer, (2) add a dedicated limitations paragraph quantifying the sample-size constraint, and (3) state that larger-scale real-world collection remains necessary for deployment claims. The EDSA-isolation ablation and synthetic held-out results will remain the primary evidence. revision: partial
Referee: [abstract and results reporting] Pre-registered EHR targets: all beta models fall below the stated targets (0.75 for Te, 0.65 for Hi/Ta), yet the abstract and title frame the work as closing the gap. This discrepancy requires explicit discussion of why the targets were missed and how the reported gains should be interpreted relative to the original goals.

Authors: We accept that the current abstract and title language risks overstating the outcome relative to the pre-registered targets. We will revise the abstract to emphasize relative gains (17× over open-source SOTA, 3× over Deepgram on Telugu) while explicitly noting that absolute pre-registered targets were not met. A new paragraph in the Results or Discussion section will explain the shortfall—primarily the difficulty of entity-dense code-mixed speech, residual TTS artifacts, and the ambitious nature of the 0.75/0.65 targets—and will interpret the achieved EHR values in that context. The honest reporting of missed targets will be retained and highlighted. revision: yes
Referee: [test set and EDSA corpus description] Held-out test construction and synthesis fidelity: main results (including the 17x/3x gains and EDSA ablation) are measured on synthetic data generated by the same TTS pipeline used for training. Without detailed verification that the held-out set is independent and that synthesis fidelity matches real acoustic/linguistic properties, the headline improvements risk over-optimism.

Authors: The held-out set uses disjoint entity combinations and synthesis seeds from the training data, but we agree that the manuscript currently provides insufficient detail on independence and fidelity verification. In revision we will add (1) a methods subsection describing the exact procedure for generating and partitioning the held-out utterances, (2) any available quantitative checks (e.g., TTS quality metrics or manual inspection summaries), and (3) an expanded discussion of synthetic-to-real distribution shift, cross-referencing the native sanity check and the EDSA ablation as mitigating evidence. These additions will make the evaluation boundaries clearer without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline with external baselines and released artifacts

full rationale

The paper describes a standard empirical workflow: an open-source TTS pipeline generates a synthetic EDSA corpus of ~22k utterances, a LoRA fine-tune is applied to a base ASR model (vasista22), and performance is measured via EHR on a held-out synthetic test set plus a small native sanity check (n=20). No equations, first-principles derivations, or self-referential definitions appear; the EDSA-isolation ablation simply compares training on FLEURS-Te alone versus the new corpus, which is ordinary experimental control rather than a fitted parameter renamed as prediction. All headline metrics are compared against external open-source and commercial baselines, with code, data, and holdouts released. The small native check is a genuine limitation on transfer claims but does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions in machine learning for ASR fine-tuning and the fidelity of TTS synthesis, with no new free parameters explicitly fitted beyond the training process.

axioms (2)

domain assumption LoRA fine-tuning on Whisper models can adapt to domain-specific data without catastrophic forgetting of general capabilities
Assumed in the fine-tuning step and bounded regression on FLEURS-Te.
domain assumption Synthetic data from TTS can approximate real speech distributions for entity recognition tasks
Central to the flywheel approach and transfer to native speech.

pith-pipeline@v0.9.0 · 5729 in / 1589 out tokens · 102927 ms · 2026-05-08T18:17:50.576651+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Whisper Telugu / Tamil / Hindi Large-v2: Whisper fine-tunes for Indic languages,

V. S. Lodagala, “Whisper Telugu / Tamil / Hindi Large-v2: Whisper fine-tunes for Indic languages,” https://huggingface. co/vasista22/whisper-telugu-large-v2 , 2023, released as part of the Whisper Fine-tuning Sprint; code at https://github.com/ vasistalodagala/whisper-finetune. No associated peer-reviewed paper

work page 2023
[2]

Vistaar: Diverse benchmarks and training sets for Indian language ASR,

K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for Indian language ASR,” in Proc. Interspeech 2023, 2023, pp. 4384–4388

work page 2023
[3]

IndicConformer-600M-Multilingual: Conformer- based ASR for 22 Indian languages,

AI4Bharat, “IndicConformer-600M-Multilingual: Conformer- based ASR for 22 Indian languages,” https://huggingface.co/ ai4bharat/indic-conformer-600m-multilingual , 2024, model re- lease; no associated peer-reviewed paper as of 2026-05-02

work page 2024
[4]

IndicWhisper: Whisper fine-tunes for Indian languages,

——, “IndicWhisper: Whisper fine-tunes for Indian languages,” https://github.com/AI4Bharat/vistaar, 2023, released along- side Vistaar (Bhogale et al., Interspeech 2023)

work page 2023
[5]

SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, S. Liu, T. Ko, Q. Li, Y. Zhang, Z. Wei, Y. Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Comp...

work page 2022
[6]

Distil-Whisper: Robust knowledge distillation via large-scale pseudo-labeling,

S. Gandhi, P. von Platen, and A. M. Rush, “Distil-Whisper: Robust knowledge distillation via large-scale pseudo-labeling,” 2023

work page 2023
[7]

Script collapse in multilingual ASR: Defining and measuring script fidelity rate,

H. Rahman, “Script collapse in multilingual ASR: Defining and measuring script fidelity rate,” https://arxiv.org/abs/2604. 08786, 2026, author and title verified from arXiv abs page on 2026-05-02

work page 2026
[8]

Praxy voice: An open-source cross-script voice- cloning TTS for Indic languages,

V. P. T. Menta, “Praxy voice: An open-source cross-script voice- cloning TTS for Indic languages,” 2026

work page 2026
[9]

PSP: Phoneme substitution profile for automatic accent evaluation in indic TTS,

——, “PSP: Phoneme substitution profile for automatic accent evaluation in indic TTS,” 2026

work page 2026
[10]

LASE: Language-adversarial speaker encoding for indic cross-script identity preservation,

——, “LASE: Language-adversarial speaker encoding for indic cross-script identity preservation,” https://arxiv.org/abs/2605. 00777, 2026, code + weights at https://github.com/praxelhq/ lase and https://huggingface.co/Praxel/lase-r1

work page 2026
[11]

IndicVoices: To- wards building an inclusive multilingual speech dataset for Indian languages,

T. Javed, J. A. Nawale, E. I. George, S. Joshi, K. S. Bhogale, D. Mehendale, I. V. Sethi, A. Ananthanarayanan, H. Faquih, P. Palit, S. Ravishankar, S. Sukumaran, T. Panchagnula, S. Mu- rali, K. S. Gandhi, A. R, M. K. K, C. V. Vaijayanthi, K. S. R. Karunganni, P. Kumar, and M. M. Khapra, “IndicVoices: To- wards building an inclusive multilingual speech dat...

work page 2024
[12]

Common Voice corpus 25.0,

Mozilla Foundation, “Common Voice corpus 25.0,” https:// commonvoice.mozilla.org/en/datasets, 2025, accessed 2026-05- 02; CV 25.0 release dated 2025-09-15

work page 2025
[13]

FLEURS: Few- shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few- shot learning evaluation of universal representations of speech,” in Proc. IEEE Spoken Language Technology Workshop (SLT) , 2022, pp. 798–805

work page 2022
[14]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022, we use the v3 checkpoint released in 2023 via openai/whisper-large-v3

work page 2022

[1] [1]

Whisper Telugu / Tamil / Hindi Large-v2: Whisper fine-tunes for Indic languages,

V. S. Lodagala, “Whisper Telugu / Tamil / Hindi Large-v2: Whisper fine-tunes for Indic languages,” https://huggingface. co/vasista22/whisper-telugu-large-v2 , 2023, released as part of the Whisper Fine-tuning Sprint; code at https://github.com/ vasistalodagala/whisper-finetune. No associated peer-reviewed paper

work page 2023

[2] [2]

Vistaar: Diverse benchmarks and training sets for Indian language ASR,

K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for Indian language ASR,” in Proc. Interspeech 2023, 2023, pp. 4384–4388

work page 2023

[3] [3]

IndicConformer-600M-Multilingual: Conformer- based ASR for 22 Indian languages,

AI4Bharat, “IndicConformer-600M-Multilingual: Conformer- based ASR for 22 Indian languages,” https://huggingface.co/ ai4bharat/indic-conformer-600m-multilingual , 2024, model re- lease; no associated peer-reviewed paper as of 2026-05-02

work page 2024

[4] [4]

IndicWhisper: Whisper fine-tunes for Indian languages,

——, “IndicWhisper: Whisper fine-tunes for Indian languages,” https://github.com/AI4Bharat/vistaar, 2023, released along- side Vistaar (Bhogale et al., Interspeech 2023)

work page 2023

[5] [5]

SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, S. Liu, T. Ko, Q. Li, Y. Zhang, Z. Wei, Y. Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Comp...

work page 2022

[6] [6]

Distil-Whisper: Robust knowledge distillation via large-scale pseudo-labeling,

S. Gandhi, P. von Platen, and A. M. Rush, “Distil-Whisper: Robust knowledge distillation via large-scale pseudo-labeling,” 2023

work page 2023

[7] [7]

Script collapse in multilingual ASR: Defining and measuring script fidelity rate,

H. Rahman, “Script collapse in multilingual ASR: Defining and measuring script fidelity rate,” https://arxiv.org/abs/2604. 08786, 2026, author and title verified from arXiv abs page on 2026-05-02

work page 2026

[8] [8]

Praxy voice: An open-source cross-script voice- cloning TTS for Indic languages,

V. P. T. Menta, “Praxy voice: An open-source cross-script voice- cloning TTS for Indic languages,” 2026

work page 2026

[9] [9]

PSP: Phoneme substitution profile for automatic accent evaluation in indic TTS,

——, “PSP: Phoneme substitution profile for automatic accent evaluation in indic TTS,” 2026

work page 2026

[10] [10]

LASE: Language-adversarial speaker encoding for indic cross-script identity preservation,

——, “LASE: Language-adversarial speaker encoding for indic cross-script identity preservation,” https://arxiv.org/abs/2605. 00777, 2026, code + weights at https://github.com/praxelhq/ lase and https://huggingface.co/Praxel/lase-r1

work page 2026

[11] [11]

IndicVoices: To- wards building an inclusive multilingual speech dataset for Indian languages,

T. Javed, J. A. Nawale, E. I. George, S. Joshi, K. S. Bhogale, D. Mehendale, I. V. Sethi, A. Ananthanarayanan, H. Faquih, P. Palit, S. Ravishankar, S. Sukumaran, T. Panchagnula, S. Mu- rali, K. S. Gandhi, A. R, M. K. K, C. V. Vaijayanthi, K. S. R. Karunganni, P. Kumar, and M. M. Khapra, “IndicVoices: To- wards building an inclusive multilingual speech dat...

work page 2024

[12] [12]

Common Voice corpus 25.0,

Mozilla Foundation, “Common Voice corpus 25.0,” https:// commonvoice.mozilla.org/en/datasets, 2025, accessed 2026-05- 02; CV 25.0 release dated 2025-09-15

work page 2025

[13] [13]

FLEURS: Few- shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few- shot learning evaluation of universal representations of speech,” in Proc. IEEE Spoken Language Technology Workshop (SLT) , 2022, pp. 798–805

work page 2022

[14] [14]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022, we use the v3 checkpoint released in 2023 via openai/whisper-large-v3

work page 2022