arxiv: 2603.18239 · v1 · submitted 2026-03-18 · 🧬 q-bio.QM · cs.CL· cs.LG

Recognition: no theorem link

Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

Himadri S Samanta

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:21 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.CLcs.LG

keywords Alzheimer's disease detectionautomatic speech recognitionWhisper ASRlexical featuresTF-IDFspontaneous speechmachine learning classification

0 comments

The pith

Higher-quality Whisper transcripts improve Alzheimer's detection accuracy using simple lexical models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether automatic speech recognition quality changes how well lexical features can detect Alzheimer's disease from spontaneous speech. It processes the ADReSSo 2021 dataset with two Whisper model sizes, extracts TF-IDF representations, and trains Logistic Regression and Linear SVM classifiers under repeated stratified cross-validation. Models using the higher-quality Whisper-small transcripts reach balanced accuracy above 0.785 with Linear SVM and show statistically significant gains over Whisper-base transcripts. Classifier choice affects results less than transcript quality. Language feature analysis finds that healthy speakers use more precise object and scene descriptions while Alzheimer's speech contains more vagueness, discourse markers, and hesitation patterns.

Core claim

Transcript quality from automatic speech recognition exerts a statistically significant effect on classification performance. Whisper-small transcripts produce higher balanced accuracy than Whisper-base transcripts when the same lexical TF-IDF pipeline and interpretable classifiers are applied to the ADReSSo 2021 diagnosis set. Linear SVM reaches above 0.785 balanced accuracy on the better transcripts. Paired statistical tests confirm the improvement. Feature inspection shows cognitively normal speakers favor precise descriptive language whereas Alzheimer's speakers exhibit vagueness, discourse markers, and hesitation.

What carries the argument

Lexical TF-IDF features extracted from ASR transcripts, fed to Linear SVM and Logistic Regression under repeated 5x5 stratified cross-validation with paired statistical testing.

If this is right

High-quality ASR allows simple, interpretable lexical models to reach competitive Alzheimer's detection without acoustic modeling.
ASR model choice is a more important modeling decision than classifier complexity in clinical speech systems.
Cognitively normal speech contains more precise object- and scene-descriptive terms than Alzheimer's speech.
A reproducible benchmark pipeline is supplied for future ASR-quality studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Speech-based diagnostic pipelines may benefit more from upgrading the ASR front-end than from adding model complexity.
The same lexical approach could be tested on other neurodegenerative conditions that alter spontaneous speech.
Adding acoustic features might either amplify or reduce the observed ASR-quality effect.

Load-bearing premise

The ADReSSo 2021 speech samples and diagnosis labels represent typical real-world clinical speech, and lexical features alone are sufficient to capture disease-relevant differences.

What would settle it

Re-running the identical lexical pipeline on an independent spontaneous-speech dataset of Alzheimer's patients and controls and testing whether the performance gap between Whisper-small and Whisper-base transcripts stays statistically significant.

Figures

Figures reproduced from arXiv: 2603.18239 by Himadri S Samanta.

**Figure 2.** Figure 2: Repeated 5×5 cross-validation balanced accuracy for each classifier/ASR combination. Error bars denote standard deviation across 25 folds [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Effect sizes (Cohen’s d) for key paired comparisons. The ASR effect is larger than the classifier effect. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Out-of-fold ROC curve for logistic regression trained on Whisper-small tran [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Aggregate confusion matrix (raw counts) for logistic regression with Whisper [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized confusion matrix for logistic regression with Whisper-small tran [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Top AD-indicative lexical terms from the logistic-regression model trained on [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Top CN-indicative lexical terms from the logistic-regression model trained on [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Early detection of Alzheimer's disease from spontaneous speech has emerged as a promising non-invasive screening approach. However, the influence of automatic speech recognition (ASR) quality on downstream clinical language modeling remains insufficiently understood. In this study, we investigate Alzheimer's disease detection using lexical features derived from Whisper ASR transcripts on the ADReSSo 2021 diagnosis dataset. We evaluate interpretable machine-learning models, including Logistic Regression and Linear Support Vector Machines, using TF-IDF text representations under repeated 5x5 stratified cross-validation. Our results demonstrate that transcript quality has a statistically significant impact on classification performance. Models trained on Whisper-small transcripts consistently outperform those using Whisper-base transcripts, achieving balanced accuracy above 0.7850 with Linear SVM. Paired statistical testing confirms that the observed improvements are significant. Importantly, classifier complexity contributes less to performance variation than ASR transcription quality. Feature analysis reveals that cognitively normal speakers produce more semantically precise object- and scene-descriptive language, whereas Alzheimer's speech is characterized by vagueness, discourse markers, and increased hesitation patterns. These findings suggest that high-quality ASR can enable simple, interpretable lexical models to achieve competitive Alzheimer's detection performance without explicit acoustic modeling. The study provides a reproducible benchmark pipeline and highlights ASR selection as a critical modeling decision in clinical speech-based artificial intelligence systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a benchmark study on Alzheimer's disease detection from spontaneous speech on the ADReSSo 2021 dataset. It derives TF-IDF lexical features from Whisper-base and Whisper-small ASR transcripts, trains Logistic Regression and Linear SVM classifiers under repeated 5x5 stratified cross-validation, and claims that higher-quality (small) transcripts yield statistically significant gains in balanced accuracy (above 0.7850 for Linear SVM), that ASR quality matters more than classifier choice, and that cognitively normal speech shows more precise descriptive language while AD speech exhibits vagueness and hesitation markers. A reproducible pipeline is provided.

Significance. If the statistical claims hold under proper dependence-aware testing, the work supplies a useful public-data benchmark showing that ASR selection can be more load-bearing than model complexity for interpretable lexical AD detection, with direct implications for practical non-invasive screening systems that avoid acoustic feature engineering.

major comments (2)

[Results / Statistical Validation] The paired statistical testing (abstract and results section) that establishes significant gains for Whisper-small over base transcripts does not specify the exact test (paired t-test vs. Wilcoxon), report p-values or effect sizes, or address dependence among the 25 CV folds arising from overlapping training sets. Standard paired tests assume fold independence; without a permutation test or blocked resampling that respects the CV structure, the headline claim that ASR quality drives performance more than classifier choice cannot be fully evaluated.
[Results] Table or figure reporting balanced accuracy >0.7850 for Linear SVM on Whisper-small transcripts provides no per-fold standard deviations, confidence intervals, or variance across the repeated CV runs, weakening assessment of result stability and the cross-ASR comparison.

minor comments (2)

[Abstract] The abstract and methods should explicitly name the statistical test, threshold, and software used for the paired comparison to allow immediate reproducibility.
[Discussion] Feature analysis claims about vagueness and hesitation patterns in AD speech would benefit from quantitative counts or example n-grams rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major point below and will revise the paper to strengthen the statistical reporting and transparency of our results.

read point-by-point responses

Referee: [Results / Statistical Validation] The paired statistical testing (abstract and results section) that establishes significant gains for Whisper-small over base transcripts does not specify the exact test (paired t-test vs. Wilcoxon), report p-values or effect sizes, or address dependence among the 25 CV folds arising from overlapping training sets. Standard paired tests assume fold independence; without a permutation test or blocked resampling that respects the CV structure, the headline claim that ASR quality drives performance more than classifier choice cannot be fully evaluated.

Authors: We agree that the description of the statistical testing was insufficiently detailed. In the revised manuscript we will explicitly state that we applied a paired Wilcoxon signed-rank test to the per-fold balanced accuracy scores, report the exact p-values together with effect sizes (rank-biserial correlation), and add a permutation test that respects the repeated CV structure by randomly shuffling labels within each training partition while preserving the fold structure. These additions will allow readers to fully evaluate the claim that ASR quality contributes more to performance variation than classifier choice. revision: yes
Referee: [Results] Table or figure reporting balanced accuracy >0.7850 for Linear SVM on Whisper-small transcripts provides no per-fold standard deviations, confidence intervals, or variance across the repeated CV runs, weakening assessment of result stability and the cross-ASR comparison.

Authors: We acknowledge the omission. The revised results section and associated table will report the mean balanced accuracy together with the standard deviation across the 25 CV folds and 95% confidence intervals obtained via bootstrap resampling of the fold-level scores. This will enable direct assessment of result stability and the reliability of the cross-ASR performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark on external public dataset

full rationale

The paper reports direct empirical comparisons of standard ML pipelines (TF-IDF lexical features, Linear SVM / Logistic Regression) on the public ADReSSo 2021 dataset under repeated 5x5 stratified CV. Performance differences between Whisper-small and Whisper-base transcripts are measured via paired statistical tests on held-out folds. No equations derive a target quantity from a fitted parameter that is then re-used as a prediction, no self-citation supplies a uniqueness theorem or ansatz that bears the central claim, and no known result is merely renamed. The derivation chain consists entirely of reproducible data processing and off-the-shelf classifiers evaluated on external labels, so the reported balanced-accuracy gains and significance statements are independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities are introduced. The work rests on standard machine-learning assumptions about data labels and feature sufficiency.

axioms (2)

domain assumption ADReSSo 2021 dataset labels accurately reflect clinical Alzheimer's status
Required for supervised training and evaluation of classifiers.
domain assumption Lexical TF-IDF features are sufficient to capture disease-related language differences
Central modeling choice that excludes acoustic features.

pith-pipeline@v0.9.0 · 5543 in / 1228 out tokens · 50346 ms · 2026-05-15T08:21:59.320893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

K. C. Fraser, J. A. Meltzer, F. Rudzicz, Linguistic features identify alzheimer’s disease in narrative speech, Journal of Alzheimer’s Disease 49 (2) (2016) 407–422.doi:10.3233/JAD-150520

work page doi:10.3233/jad-150520 2016
[2]

de la Fuente Garcia, S

S. de la Fuente Garcia, S. Luz, Evaluating the effect of linguistic and acoustic features on alzheimer’s dementia recognition, Frontiers in Aging Neuroscience 10 (2018) 207.doi:10.3389/fnagi.2018.00207

work page doi:10.3389/fnagi.2018.00207 2018
[3]

Haider, S

F. Haider, S. de la Fuente, S. Luz, An investigation of acoustic and linguistic features for alzheimer’s dementia detection, Frontiers in Com- puter Science 2 (2020) 624659.doi:10.3389/fcomp.2020.624659. 18

work page doi:10.3389/fcomp.2020.624659 2020
[4]

S. Luz, F. Haider, S. de la Fuente Garcia, Editorial: Alzheimer’s de- mentia recognition through spontaneous speech, Frontiers in Computer Science 3 (2021) 780169.doi:10.3389/fcomp.2021.780169

work page doi:10.3389/fcomp.2021.780169 2021
[5]

Z. S. Syed, M. Rashid, R. Naqvi, S. Ehsan, X. Wang, M. ur Rehman, R. Naguib, K. McDonald-Maier, Tackling the adresso challenge 2021: The muet-rmit system for alzheimer’s dementia recognition from spon- taneous speech, in: Proc. Interspeech 2021, 2021, pp. 3810–3814.doi: 10.21437/Interspeech.2021-1761

work page doi:10.21437/interspeech.2021-1761 2021
[6]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via large-scale weak supervision, in: Proceed- ings of the 40th International Conference on Machine Learning, 2023, pp. 28492–28518

work page 2023
[7]

S. O. Orimaye, J. S.-M. Wong, K. J. Golden, Predicting probable alzheimer’s disease using linguistic deficits and biomarkers, BMC Bioin- formatics 15 (2014) 34.doi:10.1186/1471-2105-15-34

work page doi:10.1186/1471-2105-15-34 2014
[8]

Jarrold, B

W. Jarrold, B. Peintner, E. Yeh, R. Krasnow, H. Javitz, G. E. Swan, Disfluencies and discourse perseveration in alzheimer’s disease, Interna- tional Journal of Language & Communication Disorders 49 (6) (2014) 617–628.doi:10.1111/1460-6984.12095

work page doi:10.1111/1460-6984.12095 2014
[9]

A. Satt, S. Rozenberg, R. Hoory, Efficient emotion recognition from speech using deep learning on spectrograms, in: Proc. Interspeech 2017, 2017, pp. 1089–1093

work page 2017
[10]

S. V. Pakhomov, et al., Computerized analysis of speech and language to identifypsycholinguisticcorrelatesoffrontotemporallobardegeneration, Cognitive and Behavioral Neurology 23 (3) (2010) 165–177

work page 2010
[11]

Taler, N

V. Taler, N. A. Phillips, Language and alzheimer’s disease, Current Alzheimer Research 5 (4) (2008) 352–366

work page 2008
[12]

S. Luz, F. Haider, S. de la Fuente, D. Fromm, B. MacWhinney, Alzheimer’s dementia recognition through spontaneous speech: The adress challenge, in: Proc. Interspeech 2020, 2020, pp. 2172–2176. doi:10.21437/Interspeech.2020-2571. 19

work page doi:10.21437/interspeech.2020-2571 2020
[13]

MacWhinney, The talkbank system for research on spoken communi- cation, in: The Oxford Handbook of Psycholinguistics, Oxford Univer- sity Press, 2011

B. MacWhinney, The talkbank system for research on spoken communi- cation, in: The Oxford Handbook of Psycholinguistics, Oxford Univer- sity Press, 2011

work page 2011
[14]

A. M. Lanzi, A. K. Saylor, D. Fromm, H. Liu, B. MacWhinney, M. L. Cohen, Dementiabank: Theoretical rationale, protocol, and illustrative analyses, American Journal of Speech-Language Pathology 32 (2) (2023) 426–438.doi:10.1044/2022_AJSLP-22-00281

work page doi:10.1044/2022_ajslp-22-00281 2023
[15]

de la Fuente Garcia, S

S. de la Fuente Garcia, S. Luz, Evaluation of computational features for automatic prediction of mild cognitive impairment from speech, Fron- tiers in Aging Neuroscience 12 (2020) 593215.doi:10.3389/fnagi. 2020.593215

work page doi:10.3389/fnagi 2020
[16]

J. Chen, Z. Ke, Q. Zhu, Y. Wang, et al., Automatic detection of alzheimer’s disease using spontaneous speech only, Frontiers in Aging Neuroscience 14 (2022) 843456

work page 2022
[17]

X. Qi, H. Zhang, et al., Noninvasive automatic detection of alzheimer’s disease from spontaneous speech with prompt-based learning, Frontiers in Aging Neuroscience 15 (2023) 1172960

work page 2023
[18]

A.Balagopalan, etal., Comparisonofspeechtechnologiesforalzheimer’s disease detection, Computer Speech & Language (2020)

work page 2020
[19]

Ramos, Using TF-IDF to determine word relevance in document queries, Proceedings of the First Instructional Conference on Machine Learning (2003)

J. Ramos, Using TF-IDF to determine word relevance in document queries, Proceedings of the First Instructional Conference on Machine Learning (2003)

work page 2003
[20]

C.Cortes, V.Vapnik, Support-vectornetworks, MachineLearning20(3) (1995) 273–297.doi:10.1007/BF00994018

work page doi:10.1007/bf00994018 1995
[21]

Fan, K.-W

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLIN- EAR: A library for large linear classification, Journal of Machine Learn- ing Research 9 (2008) 1871–1874

work page 2008
[22]

T.Hastie, R.Tibshirani, J.Friedman, TheElementsofStatisticalLearn- ing, 2nd Edition, Springer, 2009

work page 2009
[23]

Friedman, T

J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for general- ized linear models via coordinate descent, Journal of Statistical Software 33 (1) (2010) 1–22. 20

work page 2010
[24]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008

work page 2017
[25]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, NAACL- HLT (2019) 4171–4186

work page 2019
[26]

Goodfellow, Y

I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016

work page 2016
[27]

Baevski, H

A. Baevski, H. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A frame- work for self-supervised learning of speech representations, Advances in Neural Information Processing Systems 33 (2020) 12449–12460

work page 2020
[28]

Kong, et al., Self-supervised speech representation learning for neu- rodegenerative disorder detection: A review, IEEE Journal of Biomedi- cal and Health Informatics (2022)

W. Kong, et al., Self-supervised speech representation learning for neu- rodegenerative disorder detection: A review, IEEE Journal of Biomedi- cal and Health Informatics (2022)

work page 2022
[29]

Eyben, M

F. Eyben, M. Woellmer, B. Schuller, The munich opensmile toolkit, Proceedings of the ACM Multimedia (2010)

work page 2010
[30]

Roark, M

B. Roark, M. Mitchell, K. Hollingshead, et al., Spoken language derived measures for detecting mild cognitive impairment, IEEE Transactions on Audio, Speech, and Language Processing 19 (7) (2011) 2081–2090

work page 2011
[31]

Wu, et al., A review of automatic speech and language processing for alzheimer’s disease and mild cognitive impairment, IEEE Reviews in Biomedical Engineering (2020)

Y. Wu, et al., A review of automatic speech and language processing for alzheimer’s disease and mild cognitive impairment, IEEE Reviews in Biomedical Engineering (2020)

work page 2020
[32]

A. B. R. Shatte, D. M. Hutchinson, S. J. Teague, Machine learning in mental health and related disorders: A systematic review, Journal of Medical Internet Research 21 (5) (2019) e15768

work page 2019
[33]

A. L. Beam, I. S. Kohane, Big data and machine learning in health care, JAMA 319 (13) (2018) 1317–1318

work page 2018
[34]

Rajkomar, J

A. Rajkomar, J. Dean, I. Kohane, Machine learning in medicine, New England Journal of Medicine 380 (14) (2019) 1347–1358

work page 2019
[35]

E. J. Topol, High-performance medicine: the convergence of human and artificial intelligence, Nature Medicine 25 (1) (2019) 44–56. 21

work page 2019
[36]

Zhou, et al., Interpretable machine learning for healthcare, Nature Biomedical Engineering (2021)

Y. Zhou, et al., Interpretable machine learning for healthcare, Nature Biomedical Engineering (2021)

work page 2021
[37]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830. 22

work page 2011