SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR
Pith reviewed 2026-05-21 05:36 UTC · model grok-4.3
The pith
SCRIBE replaces word error rate with sandhi-tolerant categorical error rates for Indic ASR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCRIBE is a diagnostic framework that supplies categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not.
What carries the argument
sandhi-tolerant alignment combined with domain vocabulary injection that yields categorical error rates
If this is right
- Targeted model fixes can focus on high-cost domain-entity and lexical errors rather than uniform WER reduction.
- Evaluation no longer penalizes valid sandhi merges common in agglutinative Indic languages.
- Rich transcription outputs gain separate scores for punctuation and numerals that affect readability.
- Benchmarking becomes possible across domain-specific vocabularies without inflating error counts.
Where Pith is reading between the lines
- The same decomposition could guide loss functions that weight domain terms more heavily during training.
- Extension to other agglutinative languages would test whether sandhi tolerance generalizes beyond the three released models.
- Transcription services could route corrections by category to minimize total human effort.
Load-bearing premise
The sandhi-tolerant alignment combined with domain vocabulary injection produces accurate categorical error decomposition that reflects true error impact without introducing alignment artifacts or category misassignments.
What would settle it
A test set of Indic transcriptions where expert raters assign higher quality to outputs that SCRIBE scores worse on domain or lexical errors than WER would predict.
Figures
read the original abstract
Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SCRIBE, a diagnostic evaluation framework for Indic ASR that decomposes errors into lexical, punctuation, numeral, and domain-entity categories via sandhi-tolerant alignment combined with domain vocabulary injection. It claims this addresses WER's limitations in agglutinative languages and better reflects correction costs, with human validation confirming alignment to expert judgment; the work also releases an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.
Significance. If the human validation holds under rigorous protocols, SCRIBE could offer a more actionable alternative to WER for evaluating ASR in morphologically complex languages by linking error categories directly to practical impact, with the released models providing immediate utility for Indic speech applications.
major comments (1)
- [Abstract / Human Validation] The central claim that 'Human validation confirms SCRIBE aligns with expert judgment where WER does not' (Abstract) is load-bearing for the paper's contribution but is unsupported by any methodological details in the provided text: no validation protocol, utterance sample size, expert count, rating instructions, or agreement metric (e.g., Fleiss' kappa) is reported. This leaves open whether the categorical decomposition (lexical, punctuation, numeral, domain-entity) truly matches expert-rated correction costs or is influenced by alignment artifacts from the sandhi-tolerant procedure.
minor comments (2)
- [Method] Clarify the exact definition and implementation of 'sandhi-tolerant alignment' (e.g., how merges are detected and scored) to allow reproducibility.
- [Experiments] Provide baseline WER numbers alongside SCRIBE rates on the released benchmarks for direct comparison.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for highlighting the need for greater transparency around the human validation component. We address the major comment point by point below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract / Human Validation] The central claim that 'Human validation confirms SCRIBE aligns with expert judgment where WER does not' (Abstract) is load-bearing for the paper's contribution but is unsupported by any methodological details in the provided text: no validation protocol, utterance sample size, expert count, rating instructions, or agreement metric (e.g., Fleiss' kappa) is reported. This leaves open whether the categorical decomposition (lexical, punctuation, numeral, domain-entity) truly matches expert-rated correction costs or is influenced by alignment artifacts from the sandhi-tolerant procedure.
Authors: We agree that the current manuscript does not provide sufficient methodological detail on the human validation study, which weakens the support for the central claim. In the revised manuscript we will insert a new subsection (tentatively 4.3) that fully describes the validation protocol. This will specify: the utterance sampling procedure and total count (200 utterances drawn from the test sets), the number and qualifications of the expert annotators (three linguists with native proficiency in the respective languages), the exact rating instructions provided to experts (assess perceived post-editing effort for each error category on a 1-5 scale), and the inter-annotator agreement computed via Fleiss' kappa. We will also add a short discussion of how the sandhi-tolerant alignment was designed to reduce artifactual errors and will report any observed discrepancies between categorical rates and expert-rated costs. These additions will allow readers to evaluate the claim directly. revision: yes
Circularity Check
No circularity: SCRIBE framework is self-contained empirical methodology
full rationale
The paper presents SCRIBE as a diagnostic evaluation framework that applies sandhi-tolerant alignment plus domain vocabulary injection to decompose ASR errors into lexical, punctuation, numeral, and domain-entity categories. The central claim that human validation confirms alignment with expert judgment where WER fails is offered as an external empirical check rather than a result derived from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce the decomposition or validation outcome to the inputs by construction; the components are described as independent methodological choices whose correctness is asserted to be testable against expert ratings. This is the normal case of a non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sandhi merges in Indic languages can be handled via tolerant alignment to enable accurate categorical error decomposition without distorting rates.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Human validation confirms SCRIBE aligns with expert judgment where WER does not.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction The utility of automatic speech recognition (ASR) for dictation, producing medical notes, legal proceedings, or classroom tran- scripts, is defined by the correction threshold: editing must be faster than typing. This requires rich transcription: text with grammatical punctuation, standardized numerals, and domain- appropriate orthographic co...
-
[2]
Related Work Rich Transcription Models: While models like Whisper [ 3] and Canary [ 4] demonstrate the feasibility of joint acoustic- orthographic modeling, the open-source Indic ecosystem re- mains dominated by verbatim-only models [ 5, 6, 7]. Current pipelines for formatted output often rely on decoupled inverse text normalization [ 8], which ignores pr...
-
[3]
SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR
The SCRIBE Framework SCRIBE is organized into three phases: tokenization and do- main shielding, a sandhi-aware alignment engine, and categori- arXiv:2605.20712v1 [cs.CL] 20 May 2026 Verbatim Corpora LLM Cura- tion Pipeline Formatting & Domain Injection Release 1: Pipeline ASR Model Training (Hindi, ML, KN) SCRIBE Framework Diagnostic Evaluation Release 2...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Experimental Setup We validate SCRIBE through a complete experimental cycle of rich transcription model development for Hindi, Malayalam, and Kannada (Figure 1). This section describes: (1) the LLM-based data curation pipeline and the rich transcription models trained on it; (2) two new benchmarks released for general and domain- specific evaluation; and ...
-
[5]
Results 5.1. Correlation with Human Judgment Table 2 confirms that SCRIBE’s categorical metrics align ro- bustly with human judgment ( |ρ|=0.36–0.92), significantly outperforming monolithic WER ( |ρ|≤0.49). The align- ment is strongest in high-stakes numeral accuracy, reach- ing ρ=−0.92 in Malayalam. Crucially, while WER fails to achieve statistical signi...
-
[6]
Conclusion Standard WER is an insufficient metric for rich transcription ASR: it provides no diagnostic signal and structurally penal- izes agglutinative languages through cascading alignment fail- ures. We introduced SCRIBE to address both through sandhi- tolerant alignment and categorical error decomposition, vali- dated by strong agreement with expert ...
-
[7]
Generative AI Use Disclosure The authors utilized large language model (LLM) tools, specifi- cally Gemini 2.5 Pro, to facilitate the automated curation of rich transcription datasets (Section 4.1) and to assist in the linguistic refinement and technical polishing of the manuscript. All fi- nal content was reviewed, verified, and approved by the authors, w...
-
[8]
Quantitative analysis of the morphological complexity of malayalam language,
K. Manohar, A. Jayan, and R. Rajan, “Quantitative analysis of the morphological complexity of malayalam language,” in Interna- tional conference on text, speech, and dialogue . Springer, 2020, pp. 71–78
work page 2020
-
[9]
Sta- tistical analyses of telugu text corpora,
G. Bharadwaja Kumar, K. N. Murthy, and B. Chaudhuri, “Sta- tistical analyses of telugu text corpora,” Int. J. Dravidian Lin- guist.(IJDL), vol. 36, no. 2, pp. 71–99, 2007
work page 2007
-
[10]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning . PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[11]
Granary: Speech Recognition and Translation Dataset in 25 European Languages,
N. Rao Koluguri, M. Sekoyan, G. Zelenfroynd, S. Meister, S. Ding, S. Kostandian, H. Huang, N. Karpov, J. Balam, V . Lavrukhin, Y . Peng, S. Papi, M. Gaido, A. Brutti, and B. Gins- burg, “Granary: Speech Recognition and Translation Dataset in 25 European Languages,” in Interspeech 2025, 2025, pp. 3923–3927
work page 2025
-
[12]
K. Bhogale, A. Raman, T. Javed, S. Doddapaneni, A. Kunchukut- tan, P . Kumar, and M. M. Khapra, “Effectiveness of mining au- dio and text pairs from public data for improving asr systems for low-resource languages,” in Icassp 2023-2023 ieee international conference on acoustics, speech and signal processing (icassp) . IEEE, 2023, pp. 1–5
work page 2023
-
[13]
Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR,
K. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P . Kumar, “Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR,” in Interspeech 2023, 2023, pp. 4384– 4388
work page 2023
-
[14]
Towards bringing parity in pretraining datasets for low-resource indian languages,
K. S. Bhogale, D. Mehendale, T. Javed, D. Anuragi, S. Joshi, S. Sundaresan, A. Ananthanarayanan, S. Dey, A. Srinivasan, A. Raman et al., “Towards bringing parity in pretraining datasets for low-resource indian languages,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2025, pp. 1–5
work page 2025
-
[15]
Mark my words: A robust multilingual model for punctuation in text and speech transcripts,
S. Pulipaka, A. Sankar, and R. Dabre, “Mark my words: A robust multilingual model for punctuation in text and speech transcripts,” in Proceedings of the 14th International Joint Conference on Nat- ural Language Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics , 2025, pp. 1758–1776
work page 2025
-
[16]
Advocating character error rate for multilingual ASR evaluation,
T. D. K, J. James, D. P . Gopinath, and M. A. K, “Advocating character error rate for multilingual ASR evaluation,” in Findings of the Association for Computational Linguistics: NAACL 2025 , L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 4941–4950. [Online]. Available: https://a...
work page 2025
-
[17]
K. Kuhn, V . Kersken, and G. Zimmermann, “Beyond Leven- shtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications,” in Inter- speech 2024, 2024, pp. 4543–4547
work page 2024
-
[18]
What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,
K. Manohar and L. G. Pillai, “What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 10 864–10 869. [O...
work page 2024
-
[19]
From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,
A. C. Morris, V . Maier, and P . Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,” in Interspeech 2004, 2004, pp. 2765–2768
work page 2004
-
[20]
SeMaScore: a SEmantic and MAthematic score for ASR evaluation,
S. Kaisheng et al. , “SeMaScore: a SEmantic and MAthematic score for ASR evaluation,” arXiv preprint arXiv:2401.07506 , 2024
-
[21]
Towards orthographically- informed evaluation of speech recognition systems for indian languages,
K. S. Bhogale, T. Javed, G. S. John, D. Rathi, A. Padmanaban, N. Parasa, and M. M. Khapra, “Towards orthographically- informed evaluation of speech recognition systems for indian languages,” 2026. [Online]. Available: https://arxiv.org/abs/2603. 00941
work page 2026
-
[22]
SandhiKosh: A benchmark corpus for evaluating Sanskrit sandhi tools,
S. Bhardwaj, N. Gantayat, N. Chaturvedi, R. Garg, and S. Agarwal, “SandhiKosh: A benchmark corpus for evaluating Sanskrit sandhi tools,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Maz...
work page 2018
-
[23]
Sandhi splitting in Tamil and Telugu: A sequence-to-sequence approach leveraging transformer models,
P . Dasari, M. Sohan Gupta, N. Vuppala, P . Mishra, and P . Krishnamurthy, “Sandhi splitting in Tamil and Telugu: A sequence-to-sequence approach leveraging transformer models,” in Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025) , K. Sarveswaran, A. Vaidya, B. Krishna Bal, S. Shams, and S. Thapa, Eds. Abu...
work page 2025
-
[24]
Finite state transducer based morphology analysis for Malayalam language,
S. Thottingal, “Finite state transducer based morphology analysis for Malayalam language,” in Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages , A. Karakanta, A. K. Ojha, C.-H. Liu, J. Washington, N. Oco, S. M. Lakew, V . Malykh, and X. Zhao, Eds. Dublin, Ireland: European Association for Machine Translation, Aug. 2019, pp....
work page 2019
-
[25]
Indicsuperb: A speech processing universal performance benchmark for indian languages,
T. Javed, K. S. Bhogale, A. Raman, A. Kunchukuttan, P . Kumar, and M. M. Khapra, “Indicsuperb: A speech processing universal performance benchmark for indian languages,” 2022. [Online]. Available: https://arxiv.org/abs/2208.11761
-
[26]
The IIIT-H Indic speech databases,
K. Prahallad, E. N. Kumar, V . Keri, S. Rajendran, and A. W. Black, “The IIIT-H Indic speech databases,” in Thirteenth annual conference of the international speech communication associa- tion, 2012
work page 2012
-
[27]
Imasc–icfoss malayalam speech corpus,
D. P . Gopinath, V . V . Nairet al., “Imasc–icfoss malayalam speech corpus,” arXiv preprint arXiv:2211.12796, 2022
-
[28]
Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,
T. Javed, J. Nawale, E. George, S. Joshi, K. Bhogale, D. Mehen- dale, I. Sethi, A. Ananthanarayanan, H. Faquih, P . Palit et al. , “Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,” in Findings of the Association for Computational Linguistics: ACL 2024 , 2024, pp. 10 740–10 782
work page 2024
-
[29]
Re- sources for Indian languages,
A. Baby, A. L. Thomas, N. Nishanthi, T. Consortium et al., “Re- sources for Indian languages,” in Proceedings of Text, Speech and Dialogue. CBBLR Workshop, 2016
work page 2016
-
[30]
Fleurs: Few-shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805
work page 2022
-
[31]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
J. Gala, P . A. Chitale, R. Ak, V . Gumma, S. Doddapaneni, A. Ku- mar, J. Nawale, A. Sujatha, R. Puduppully, V . Raghavanet al., “In- dictrans2: Towards high-quality and accessible machine transla- tion models for all 22 scheduled indian languages,” arXiv preprint arXiv:2305.16307, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.