pith. sign in

arxiv: 2605.20712 · v1 · pith:YTYZQBHNnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

Pith reviewed 2026-05-21 05:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords ASR evaluationIndic languageserror decompositionsandhi alignmentword error raterich transcriptiondomain vocabularyHindi Malayalam Kannada
0
0 comments X

The pith

SCRIBE replaces word error rate with sandhi-tolerant categorical error rates for Indic ASR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard word error rate misleads evaluation of speech recognition in Indic languages because it ignores error types and inflates scores on valid sandhi merges. SCRIBE instead decomposes errors into lexical, punctuation, numeral, and domain-entity categories using alignment that tolerates sandhi and injects domain terms. Human judges find this decomposition tracks real correction effort more closely than a single WER number. The authors release the framework, an LLM pipeline for curation, benchmarks, and open models for Hindi, Malayalam, and Kannada. The approach matters when fixing a domain term costs far more than fixing a comma.

Core claim

SCRIBE is a diagnostic framework that supplies categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not.

What carries the argument

sandhi-tolerant alignment combined with domain vocabulary injection that yields categorical error rates

If this is right

  • Targeted model fixes can focus on high-cost domain-entity and lexical errors rather than uniform WER reduction.
  • Evaluation no longer penalizes valid sandhi merges common in agglutinative Indic languages.
  • Rich transcription outputs gain separate scores for punctuation and numerals that affect readability.
  • Benchmarking becomes possible across domain-specific vocabularies without inflating error counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition could guide loss functions that weight domain terms more heavily during training.
  • Extension to other agglutinative languages would test whether sandhi tolerance generalizes beyond the three released models.
  • Transcription services could route corrections by category to minimize total human effort.

Load-bearing premise

The sandhi-tolerant alignment combined with domain vocabulary injection produces accurate categorical error decomposition that reflects true error impact without introducing alignment artifacts or category misassignments.

What would settle it

A test set of Indic transcriptions where expert raters assign higher quality to outputs that SCRIBE scores worse on domain or lexical errors than WER would predict.

Figures

Figures reproduced from arXiv: 2605.20712 by Arghya Bhattacharya, Kavya Manohar, Kumarmanas Nethil, Kush Juvekar.

Figure 1
Figure 1. Figure 1: Diagnostic-led development cycle for Indic rich transcription. SCRIBE provides the categorical feedback necessary to refine curation and verify model performance across error types. cal error aggregation. The framework outputs a diagnostic error vector E where each component maps to a specific remediation strategy. 3.1. Phase 1: Tokenization and Domain Shielding The framework transforms reference R and hy￾… view at source ↗
Figure 2
Figure 2. Figure 2: Standard libraries trigger cascading alignment shifts during linguistic merges and splits, inflating the WER, whereas SCRIBE correctly identifies these orthographic variations re￾porting 0% ERlex. 3.3. Phase 3: Categorical Error Aggregation SCRIBE aggregates errors into a diagnostic vector E = [ERlex, ERpunc, ERnum, ERent]. We employ a combined denominator Ncomb = ∑ t∈T total[t] to calculate categorical ra… view at source ↗
read the original abstract

Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SCRIBE, a diagnostic evaluation framework for Indic ASR that decomposes errors into lexical, punctuation, numeral, and domain-entity categories via sandhi-tolerant alignment combined with domain vocabulary injection. It claims this addresses WER's limitations in agglutinative languages and better reflects correction costs, with human validation confirming alignment to expert judgment; the work also releases an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

Significance. If the human validation holds under rigorous protocols, SCRIBE could offer a more actionable alternative to WER for evaluating ASR in morphologically complex languages by linking error categories directly to practical impact, with the released models providing immediate utility for Indic speech applications.

major comments (1)
  1. [Abstract / Human Validation] The central claim that 'Human validation confirms SCRIBE aligns with expert judgment where WER does not' (Abstract) is load-bearing for the paper's contribution but is unsupported by any methodological details in the provided text: no validation protocol, utterance sample size, expert count, rating instructions, or agreement metric (e.g., Fleiss' kappa) is reported. This leaves open whether the categorical decomposition (lexical, punctuation, numeral, domain-entity) truly matches expert-rated correction costs or is influenced by alignment artifacts from the sandhi-tolerant procedure.
minor comments (2)
  1. [Method] Clarify the exact definition and implementation of 'sandhi-tolerant alignment' (e.g., how merges are detected and scored) to allow reproducibility.
  2. [Experiments] Provide baseline WER numbers alongside SCRIBE rates on the released benchmarks for direct comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need for greater transparency around the human validation component. We address the major comment point by point below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract / Human Validation] The central claim that 'Human validation confirms SCRIBE aligns with expert judgment where WER does not' (Abstract) is load-bearing for the paper's contribution but is unsupported by any methodological details in the provided text: no validation protocol, utterance sample size, expert count, rating instructions, or agreement metric (e.g., Fleiss' kappa) is reported. This leaves open whether the categorical decomposition (lexical, punctuation, numeral, domain-entity) truly matches expert-rated correction costs or is influenced by alignment artifacts from the sandhi-tolerant procedure.

    Authors: We agree that the current manuscript does not provide sufficient methodological detail on the human validation study, which weakens the support for the central claim. In the revised manuscript we will insert a new subsection (tentatively 4.3) that fully describes the validation protocol. This will specify: the utterance sampling procedure and total count (200 utterances drawn from the test sets), the number and qualifications of the expert annotators (three linguists with native proficiency in the respective languages), the exact rating instructions provided to experts (assess perceived post-editing effort for each error category on a 1-5 scale), and the inter-annotator agreement computed via Fleiss' kappa. We will also add a short discussion of how the sandhi-tolerant alignment was designed to reduce artifactual errors and will report any observed discrepancies between categorical rates and expert-rated costs. These additions will allow readers to evaluate the claim directly. revision: yes

Circularity Check

0 steps flagged

No circularity: SCRIBE framework is self-contained empirical methodology

full rationale

The paper presents SCRIBE as a diagnostic evaluation framework that applies sandhi-tolerant alignment plus domain vocabulary injection to decompose ASR errors into lexical, punctuation, numeral, and domain-entity categories. The central claim that human validation confirms alignment with expert judgment where WER fails is offered as an external empirical check rather than a result derived from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce the decomposition or validation outcome to the inputs by construction; the components are described as independent methodological choices whose correctness is asserted to be testable against expert ratings. This is the normal case of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on domain assumptions about error categorization and alignment accuracy in Indic languages; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Sandhi merges in Indic languages can be handled via tolerant alignment to enable accurate categorical error decomposition without distorting rates.
    Invoked in the description of the alignment method for agglutinative languages.

pith-pipeline@v0.9.0 · 5665 in / 1178 out tokens · 27767 ms · 2026-05-21T05:36:31.467278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    This requires rich transcription: text with grammatical punctuation, standardized numerals, and domain- appropriate orthographic conventions

    Introduction The utility of automatic speech recognition (ASR) for dictation, producing medical notes, legal proceedings, or classroom tran- scripts, is defined by the correction threshold: editing must be faster than typing. This requires rich transcription: text with grammatical punctuation, standardized numerals, and domain- appropriate orthographic co...

  2. [2]

    Current pipelines for formatted output often rely on decoupled inverse text normalization [ 8], which ignores prosodic cues and homo- phone resolution

    Related Work Rich Transcription Models: While models like Whisper [ 3] and Canary [ 4] demonstrate the feasibility of joint acoustic- orthographic modeling, the open-source Indic ecosystem re- mains dominated by verbatim-only models [ 5, 6, 7]. Current pipelines for formatted output often rely on decoupled inverse text normalization [ 8], which ignores pr...

  3. [3]

    SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

    The SCRIBE Framework SCRIBE is organized into three phases: tokenization and do- main shielding, a sandhi-aware alignment engine, and categori- arXiv:2605.20712v1 [cs.CL] 20 May 2026 Verbatim Corpora LLM Cura- tion Pipeline Formatting & Domain Injection Release 1: Pipeline ASR Model Training (Hindi, ML, KN) SCRIBE Framework Diagnostic Evaluation Release 2...

  4. [4]

    302” not “three hundred two

    Experimental Setup We validate SCRIBE through a complete experimental cycle of rich transcription model development for Hindi, Malayalam, and Kannada (Figure 1). This section describes: (1) the LLM-based data curation pipeline and the rich transcription models trained on it; (2) two new benchmarks released for general and domain- specific evaluation; and ...

  5. [5]

    Results 5.1. Correlation with Human Judgment Table 2 confirms that SCRIBE’s categorical metrics align ro- bustly with human judgment ( |ρ|=0.36–0.92), significantly outperforming monolithic WER ( |ρ|≤0.49). The align- ment is strongest in high-stakes numeral accuracy, reach- ing ρ=−0.92 in Malayalam. Crucially, while WER fails to achieve statistical signi...

  6. [6]

    We introduced SCRIBE to address both through sandhi- tolerant alignment and categorical error decomposition, vali- dated by strong agreement with expert linguists

    Conclusion Standard WER is an insufficient metric for rich transcription ASR: it provides no diagnostic signal and structurally penal- izes agglutinative languages through cascading alignment fail- ures. We introduced SCRIBE to address both through sandhi- tolerant alignment and categorical error decomposition, vali- dated by strong agreement with expert ...

  7. [7]

    All fi- nal content was reviewed, verified, and approved by the authors, who take full responsibility for the integrity of the research and its presentation

    Generative AI Use Disclosure The authors utilized large language model (LLM) tools, specifi- cally Gemini 2.5 Pro, to facilitate the automated curation of rich transcription datasets (Section 4.1) and to assist in the linguistic refinement and technical polishing of the manuscript. All fi- nal content was reviewed, verified, and approved by the authors, w...

  8. [8]

    Quantitative analysis of the morphological complexity of malayalam language,

    K. Manohar, A. Jayan, and R. Rajan, “Quantitative analysis of the morphological complexity of malayalam language,” in Interna- tional conference on text, speech, and dialogue . Springer, 2020, pp. 71–78

  9. [9]

    Sta- tistical analyses of telugu text corpora,

    G. Bharadwaja Kumar, K. N. Murthy, and B. Chaudhuri, “Sta- tistical analyses of telugu text corpora,” Int. J. Dravidian Lin- guist.(IJDL), vol. 36, no. 2, pp. 71–99, 2007

  10. [10]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning . PMLR, 2023, pp. 28 492–28 518

  11. [11]

    Granary: Speech Recognition and Translation Dataset in 25 European Languages,

    N. Rao Koluguri, M. Sekoyan, G. Zelenfroynd, S. Meister, S. Ding, S. Kostandian, H. Huang, N. Karpov, J. Balam, V . Lavrukhin, Y . Peng, S. Papi, M. Gaido, A. Brutti, and B. Gins- burg, “Granary: Speech Recognition and Translation Dataset in 25 European Languages,” in Interspeech 2025, 2025, pp. 3923–3927

  12. [12]

    Effectiveness of mining au- dio and text pairs from public data for improving asr systems for low-resource languages,

    K. Bhogale, A. Raman, T. Javed, S. Doddapaneni, A. Kunchukut- tan, P . Kumar, and M. M. Khapra, “Effectiveness of mining au- dio and text pairs from public data for improving asr systems for low-resource languages,” in Icassp 2023-2023 ieee international conference on acoustics, speech and signal processing (icassp) . IEEE, 2023, pp. 1–5

  13. [13]

    Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR,

    K. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P . Kumar, “Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR,” in Interspeech 2023, 2023, pp. 4384– 4388

  14. [14]

    Towards bringing parity in pretraining datasets for low-resource indian languages,

    K. S. Bhogale, D. Mehendale, T. Javed, D. Anuragi, S. Joshi, S. Sundaresan, A. Ananthanarayanan, S. Dey, A. Srinivasan, A. Raman et al., “Towards bringing parity in pretraining datasets for low-resource indian languages,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2025, pp. 1–5

  15. [15]

    Mark my words: A robust multilingual model for punctuation in text and speech transcripts,

    S. Pulipaka, A. Sankar, and R. Dabre, “Mark my words: A robust multilingual model for punctuation in text and speech transcripts,” in Proceedings of the 14th International Joint Conference on Nat- ural Language Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics , 2025, pp. 1758–1776

  16. [16]

    Advocating character error rate for multilingual ASR evaluation,

    T. D. K, J. James, D. P . Gopinath, and M. A. K, “Advocating character error rate for multilingual ASR evaluation,” in Findings of the Association for Computational Linguistics: NAACL 2025 , L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 4941–4950. [Online]. Available: https://a...

  17. [17]

    Beyond Leven- shtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications,

    K. Kuhn, V . Kersken, and G. Zimmermann, “Beyond Leven- shtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications,” in Inter- speech 2024, 2024, pp. 4543–4547

  18. [18]

    What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,

    K. Manohar and L. G. Pillai, “What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 10 864–10 869. [O...

  19. [19]

    From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,

    A. C. Morris, V . Maier, and P . Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,” in Interspeech 2004, 2004, pp. 2765–2768

  20. [20]

    SeMaScore: a SEmantic and MAthematic score for ASR evaluation,

    S. Kaisheng et al. , “SeMaScore: a SEmantic and MAthematic score for ASR evaluation,” arXiv preprint arXiv:2401.07506 , 2024

  21. [21]

    Towards orthographically- informed evaluation of speech recognition systems for indian languages,

    K. S. Bhogale, T. Javed, G. S. John, D. Rathi, A. Padmanaban, N. Parasa, and M. M. Khapra, “Towards orthographically- informed evaluation of speech recognition systems for indian languages,” 2026. [Online]. Available: https://arxiv.org/abs/2603. 00941

  22. [22]

    SandhiKosh: A benchmark corpus for evaluating Sanskrit sandhi tools,

    S. Bhardwaj, N. Gantayat, N. Chaturvedi, R. Garg, and S. Agarwal, “SandhiKosh: A benchmark corpus for evaluating Sanskrit sandhi tools,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Maz...

  23. [23]

    Sandhi splitting in Tamil and Telugu: A sequence-to-sequence approach leveraging transformer models,

    P . Dasari, M. Sohan Gupta, N. Vuppala, P . Mishra, and P . Krishnamurthy, “Sandhi splitting in Tamil and Telugu: A sequence-to-sequence approach leveraging transformer models,” in Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025) , K. Sarveswaran, A. Vaidya, B. Krishna Bal, S. Shams, and S. Thapa, Eds. Abu...

  24. [24]

    Finite state transducer based morphology analysis for Malayalam language,

    S. Thottingal, “Finite state transducer based morphology analysis for Malayalam language,” in Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages , A. Karakanta, A. K. Ojha, C.-H. Liu, J. Washington, N. Oco, S. M. Lakew, V . Malykh, and X. Zhao, Eds. Dublin, Ireland: European Association for Machine Translation, Aug. 2019, pp....

  25. [25]

    Indicsuperb: A speech processing universal performance benchmark for indian languages,

    T. Javed, K. S. Bhogale, A. Raman, A. Kunchukuttan, P . Kumar, and M. M. Khapra, “Indicsuperb: A speech processing universal performance benchmark for indian languages,” 2022. [Online]. Available: https://arxiv.org/abs/2208.11761

  26. [26]

    The IIIT-H Indic speech databases,

    K. Prahallad, E. N. Kumar, V . Keri, S. Rajendran, and A. W. Black, “The IIIT-H Indic speech databases,” in Thirteenth annual conference of the international speech communication associa- tion, 2012

  27. [27]

    Imasc–icfoss malayalam speech corpus,

    D. P . Gopinath, V . V . Nairet al., “Imasc–icfoss malayalam speech corpus,” arXiv preprint arXiv:2211.12796, 2022

  28. [28]

    Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,

    T. Javed, J. Nawale, E. George, S. Joshi, K. Bhogale, D. Mehen- dale, I. Sethi, A. Ananthanarayanan, H. Faquih, P . Palit et al. , “Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,” in Findings of the Association for Computational Linguistics: ACL 2024 , 2024, pp. 10 740–10 782

  29. [29]

    Re- sources for Indian languages,

    A. Baby, A. L. Thomas, N. Nishanthi, T. Consortium et al., “Re- sources for Indian languages,” in Proceedings of Text, Speech and Dialogue. CBBLR Workshop, 2016

  30. [30]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805

  31. [31]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv preprint arXiv:2507.06261, 2025

  32. [32]

    In- dictrans2: Towards high-quality and accessible machine transla- tion models for all 22 scheduled indian languages,

    J. Gala, P . A. Chitale, R. Ak, V . Gumma, S. Doddapaneni, A. Ku- mar, J. Nawale, A. Sujatha, R. Puduppully, V . Raghavanet al., “In- dictrans2: Towards high-quality and accessible machine transla- tion models for all 22 scheduled indian languages,” arXiv preprint arXiv:2305.16307, 2023