arxiv: 2605.02928 · v1 · submitted 2026-04-26 · 💻 cs.SD · cs.AI

Recognition: unknown

Keyword spotting using convolutional neural network for speech recognition in Hindi

Saru Bharti , Pushparaj Mani Pathak

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:02 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords keyword spottingconvolutional neural networkHindi speechMFCCspeech recognitionon-device systemaudio classification

0 comments

The pith

A convolutional neural network identifies Hindi keywords from audio at 91.79 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how convolutional neural networks can be applied to spot specific keywords in spoken Hindi by first converting audio into Mel Frequency Cepstral Coefficients. The authors test several network designs on a collection of 40,000 audio clips, each about 1.9 seconds long, and report strong results while keeping the system light enough for phones or other devices. Such a method matters because it offers a way to build speech interfaces tailored to individual users in Hindi without needing massive computing power. Readers would care as it points to practical tools for voice-based interaction in a major world language.

Core claim

The central claim is that a CNN model using MFCC features from 40,000 Hindi audio samples achieves 91.79 percent accuracy in keyword spotting. This holds across tested architectures and supports efficient, on-device operation customized for specific users and queries in continuous Hindi speech.

What carries the argument

Convolutional Neural Networks processing Mel Frequency Cepstral Coefficients extracted from raw Hindi audio recordings to classify keywords.

If this is right

The system enables computationally efficient keyword detection suitable for on-device Hindi speech applications.
User-specific customization becomes feasible without retraining from scratch.
Different CNN architectures can be compared for balancing accuracy and speed in speech tasks.
Keyword spotting works within streams of continuous speech rather than isolated words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other languages sharing phonetic traits with Hindi.
Integration with mobile voice assistants could improve accessibility for Hindi speakers.
Performance might improve further with larger or more varied training data from multiple regions.

Load-bearing premise

The collected 40,000 audio samples at 44 kHz are representative enough of general Hindi speech to train a model that works reliably on new inputs without overfitting to the training keywords or speakers.

What would settle it

Running the model on a fresh collection of Hindi speech recordings from unseen speakers and checking whether accuracy stays near 91.79 percent or drops significantly.

Figures

Figures reproduced from arXiv: 2605.02928 by Pushparaj Mani Pathak, Saru Bharti.

**Figure 2.** Figure 2: Neural network representation The mel scale is an auditory scale of pitches perceived by listeners to be equidistant from one another. The formula for converting from frequency f to Mel scale m is given by: m = 2595 log10 1 + f 700 (3) For our neural network, we compute 13 MFCCs (nmfcc = 13) for each frame derived from the log mel spectogram by applying a discrete cosine transform (DCT). The window size… view at source ↗

**Figure 3.** Figure 3: Raw audio samples waveform [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Raw audio spectrogram [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Audio samples waveform after augmentation [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Audio spectrogram after augmentation tional efficiency and generalization ability. The learning rate was dynamically adjusted during training for the model, to achieve optimal convergence. Furthermore, an 80:20 split was maintained between training and validation data to ensure a sufficient number of training samples while minimizing the risk of over-fitting. This approach resulted in the model achieving 9… view at source ↗

**Figure 7.** Figure 7: Confusion matrix for model inference [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗

**Figure 8.** Figure 8: Training and validation accuracy [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Training and validation loss Training and validation graphs for the same are shown in Figures 8 and 9. VI. CONCLUSION In summary, our research introduces a novel Hindi keyword dataset alongside a sophisticated neural network architecture designed to effectively utilize Mel-frequency cepstral coefficients (MFCCs) for robust audio classification. The model has been carefully tested across various realworld… view at source ↗

read the original abstract

In this study, we investigate the application of keyword spotting (KWS) in the domain of Hindi speech recognition, utilizing a dataset comprising 40,000 audio samples. With a sampling rate of 44 kHz and an average duration of 1.9 seconds per sample, we focus on developing an efficient on-device KWS system tailored for user-specific queries. Leveraging Convolutional Neural Networks (CNNs) for classification, we employ feature engineering techniques to convert raw audio recordings into Mel Frequency Cepstral Coefficients (MFCCs) as an input for our network. Our experiments encompass various CNN architectures, exploring their efficacy in identifying predefined keywords within the continuous speech stream. Our CNN-based approach achieves a commendable accuracy rate of 91.79% through rigorous evaluation, demonstrating promising performance while ensuring computational efficiency and user-specific customization in Hindi speech recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates the use of convolutional neural networks (CNNs) for keyword spotting in Hindi speech recognition. It utilizes a dataset of 40,000 audio samples at 44 kHz sampling rate with an average duration of 1.9 seconds, converts them to MFCC features, and reports an accuracy of 91.79% for identifying predefined keywords, emphasizing on-device efficiency and user-specific customization.

Significance. If the accuracy claim is validated with appropriate experimental controls, this could be a meaningful contribution to speech recognition for Hindi, offering a computationally efficient approach suitable for on-device applications in a low-resource language setting. The work highlights the potential of CNNs in KWS but requires rigorous validation to establish its significance.

major comments (2)

[Abstract] The reported accuracy of 91.79% is presented without any information on the train/test split ratios, the number of distinct keywords in the dataset, speaker-disjoint evaluation protocols, cross-validation methods, or comparisons to baselines. This omission makes it impossible to determine if the result demonstrates generalizable keyword spotting or merely memorization of the training samples.
[Dataset and Methods] The assumption that the 40,000 samples are sufficient and representative for training a robust KWS system is not supported by details on keyword coverage, speaker diversity, or measures taken to avoid overfitting. The average duration of 1.9 seconds and 44 kHz sampling are given, but without these specifics, the central performance claim cannot be assessed.

minor comments (1)

[Abstract] The term 'rigorous evaluation' is used but not substantiated with specific metrics or procedures, which could be clarified for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will make in the revised version to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] The reported accuracy of 91.79% is presented without any information on the train/test split ratios, the number of distinct keywords in the dataset, speaker-disjoint evaluation protocols, cross-validation methods, or comparisons to baselines. This omission makes it impossible to determine if the result demonstrates generalizable keyword spotting or merely memorization of the training samples.

Authors: We agree that the abstract omits key experimental details necessary for evaluating the generalizability of the reported accuracy. In the revised manuscript, we will expand the abstract to explicitly state the train/test split ratios, the number of distinct keywords, the use of speaker-disjoint evaluation protocols, the cross-validation methods employed, and comparisons against baselines. These additions will clarify that the evaluation was designed to assess generalization rather than memorization. revision: yes
Referee: [Dataset and Methods] The assumption that the 40,000 samples are sufficient and representative for training a robust KWS system is not supported by details on keyword coverage, speaker diversity, or measures taken to avoid overfitting. The average duration of 1.9 seconds and 44 kHz sampling are given, but without these specifics, the central performance claim cannot be assessed.

Authors: We acknowledge that the Dataset and Methods section requires additional information to substantiate the representativeness of the 40,000-sample dataset and the robustness of the results. We will revise this section to include details on keyword coverage across the samples, speaker diversity in the dataset, and specific measures taken to avoid overfitting, such as regularization techniques and validation strategies. This will enable a more thorough assessment of the performance claim. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML pipeline with no derivations or self-referential predictions

full rationale

The paper presents an empirical keyword-spotting experiment: 40k audio samples are converted to MFCC features and fed to CNN classifiers, with a reported test accuracy of 91.79%. No equations, first-principles derivations, or predictions appear in the abstract or described manuscript. The accuracy figure is obtained by training and evaluating a model on the supplied dataset; it is not obtained by fitting a parameter and then renaming the fit as a prediction, nor by any self-citation chain that substitutes for an independent result. Because there is no derivation chain to inspect, none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, etc.) can be exhibited by direct quotation. The work is therefore self-contained as a conventional supervised-learning report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard assumptions that MFCC features adequately capture Hindi speech and that the collected dataset is unbiased and representative; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5438 in / 1047 out tokens · 51055 ms · 2026-05-09T20:02:14.120766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 7 canonical work pages

[1]

Das, ”Swaranjali: Isolated word recognition for Hindi language using VQ and HMM.” Interna- tional Conference on Multimedia Processing and Systems (ICMPS), IIT Madras, 2000

Tarun Pruthi, Sameer Saksena and Pradip K. Das, ”Swaranjali: Isolated word recognition for Hindi language using VQ and HMM.” Interna- tional Conference on Multimedia Processing and Systems (ICMPS), IIT Madras, 2000

2000
[2]

Kuldeep Kumar, R. K. Aggarwal, and Ankita Jain. ”A Hindi speech recognition system for connected words using HTK.” International Journal of Computational Systems Engineering volume 1, issue 1, 2012 pp. 25-32

2012
[3]

A. N. Mishra, M Chandra, A Biswas, SN Sharan ”Robust features for connected Hindi digits recognition.” International Journal of Signal Processing, Image Processing and Pattern Recognition 4.2 (2011),pp. 79-90

2011
[4]

Saksamudre, and R

Suman K. Saksamudre, and R. R. Deshmukh. ”Comparative Study of Isolated Word Recognition System for Hindi Language.” International Journal of Engineering Research and Technology. V ol. 4. No. 07, July
[5]

ESRSA Publications, 2015

2015
[6]

Neema Mishra, Urmila Shrawankar, and Vilas M. Thakare. ”An Overview of Hindi Speech Recognition.” arXiv preprint arXiv: 1305.2847, 2013

work page arXiv 2013
[7]

”Automatic Speech Recognition System for Hindi Utterances with Regional Indian Accents: A Review 1.” International Journal of Electronics and Communication Technology, 2013

Abhishek Thakur, and Rajesh Kumar. ”Automatic Speech Recognition System for Hindi Utterances with Regional Indian Accents: A Review 1.” International Journal of Electronics and Communication Technology, 2013

2013
[8]

”Hindi Automatic Speech Recognition Using HTK.” International Journal of Engineering Trends And Technology 4, 2013

Preeti Saini, Parneet Kaur, and Mohit Dua. ”Hindi Automatic Speech Recognition Using HTK.” International Journal of Engineering Trends And Technology 4, 2013

2013
[9]

”A large-vocabulary continuous speech recognition system for Hindi.” IBM journal of re- search and development 48.5.6, 2004, pp

Mohit Kumar, Nitendra Rajput, and Ashish Verma. ”A large-vocabulary continuous speech recognition system for Hindi.” IBM journal of re- search and development 48.5.6, 2004, pp. 703-715

2004
[10]

Sinha, S

S. Sinha, S. S. Agrawal and A. Jain. ”Continuous density Hidden Markov Model for context-dependent Hindi speech recognition”, Int. Conference on Advances in Computing, Communication and Informatics (ICACCI), pp. 1953-1958, IEEE,2013

1953
[11]

Rabiner ”A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77.2, 1989, pp

Lawrence R. Rabiner ”A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE 77.2, 1989, pp. 257-286

1989
[12]

Common V oice: A massively- multilingual speech corpus

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, ”Com- mon voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, pp. 2019

work page arXiv 1912
[13]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition”, arXiv preprint arXiv:1804.03209, 2018

work page Pith review arXiv 2018
[15]

https://www.openslr.org/118/
[16]

Parturition Hindi Speech Dataset for Automatic Speech Recognition,

Vansh Bansal, Thishyan Raj T, Nagarathna Ravi, Shubham Ko- rde, Jaskaran Kalra, Sudha Murugesan, Ramkrishnan B, Aboli Gore and Vipul Arora, “Parturition Hindi Speech Dataset for Automatic Speech Recognition,” 2023 National Conference on Communications (NCC), Guwahati, India, 2023, pp. 1-6, doi: 10.1109/NCC56989.2023.10067891

work page doi:10.1109/ncc56989.2023.10067891 2023
[17]

http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[18]

https://en.wikipedia.org/wiki/ARPABET
[19]

Aafaq, M

N. Aafaq, M. Saleem, J. T. Khan and I. H. Abbasi, ”Convolutional Neural Networks for Deep Spoken Keyword Spotting,” 2023, 3rd International Conference on Artificial Intelligence (ICAI), Islamabad, Pakistan, 2023, pp. 170-175, doi: 10.1109/ICAI58407.2023.10136648

work page doi:10.1109/icai58407.2023.10136648 2023
[20]

Deep Spoken Keyword Spotting: An Overview,

Iv ´an L ´opez-Espejo and Zheng-Hua Tan and John Hansen and Jesper Jensen “Deep Spoken Keyword Spotting: An Overview,” arXiv preprint arXiv:2111.10592, 2021

work page arXiv 2021
[21]

Multilingual and code-switching ASR challenges for low resource Indian languages,

A. Diwan, R. Vaideeswaran, S. Shah, A. Singh, S. Raghavan, S. Khare, V . Unni, S. Vyas, A. Rajpuria, C. Yarra, and A. Mittal, “Multilingual and code-switching ASR challenges for low resource Indian languages,” arXiv preprint arXiv:2104.00235, 2021

work page arXiv 2021