Recognition: no theorem link
Impact of automatic speech recognition quality on Alzheimer's disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation
Pith reviewed 2026-05-15 08:21 UTC · model grok-4.3
The pith
Higher-quality Whisper transcripts improve Alzheimer's detection accuracy using simple lexical models
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transcript quality from automatic speech recognition exerts a statistically significant effect on classification performance. Whisper-small transcripts produce higher balanced accuracy than Whisper-base transcripts when the same lexical TF-IDF pipeline and interpretable classifiers are applied to the ADReSSo 2021 diagnosis set. Linear SVM reaches above 0.785 balanced accuracy on the better transcripts. Paired statistical tests confirm the improvement. Feature inspection shows cognitively normal speakers favor precise descriptive language whereas Alzheimer's speakers exhibit vagueness, discourse markers, and hesitation.
What carries the argument
Lexical TF-IDF features extracted from ASR transcripts, fed to Linear SVM and Logistic Regression under repeated 5x5 stratified cross-validation with paired statistical testing.
If this is right
- High-quality ASR allows simple, interpretable lexical models to reach competitive Alzheimer's detection without acoustic modeling.
- ASR model choice is a more important modeling decision than classifier complexity in clinical speech systems.
- Cognitively normal speech contains more precise object- and scene-descriptive terms than Alzheimer's speech.
- A reproducible benchmark pipeline is supplied for future ASR-quality studies.
Where Pith is reading between the lines
- Speech-based diagnostic pipelines may benefit more from upgrading the ASR front-end than from adding model complexity.
- The same lexical approach could be tested on other neurodegenerative conditions that alter spontaneous speech.
- Adding acoustic features might either amplify or reduce the observed ASR-quality effect.
Load-bearing premise
The ADReSSo 2021 speech samples and diagnosis labels represent typical real-world clinical speech, and lexical features alone are sufficient to capture disease-relevant differences.
What would settle it
Re-running the identical lexical pipeline on an independent spontaneous-speech dataset of Alzheimer's patients and controls and testing whether the performance gap between Whisper-small and Whisper-base transcripts stays statistically significant.
Figures
read the original abstract
Early detection of Alzheimer's disease from spontaneous speech has emerged as a promising non-invasive screening approach. However, the influence of automatic speech recognition (ASR) quality on downstream clinical language modeling remains insufficiently understood. In this study, we investigate Alzheimer's disease detection using lexical features derived from Whisper ASR transcripts on the ADReSSo 2021 diagnosis dataset. We evaluate interpretable machine-learning models, including Logistic Regression and Linear Support Vector Machines, using TF-IDF text representations under repeated 5x5 stratified cross-validation. Our results demonstrate that transcript quality has a statistically significant impact on classification performance. Models trained on Whisper-small transcripts consistently outperform those using Whisper-base transcripts, achieving balanced accuracy above 0.7850 with Linear SVM. Paired statistical testing confirms that the observed improvements are significant. Importantly, classifier complexity contributes less to performance variation than ASR transcription quality. Feature analysis reveals that cognitively normal speakers produce more semantically precise object- and scene-descriptive language, whereas Alzheimer's speech is characterized by vagueness, discourse markers, and increased hesitation patterns. These findings suggest that high-quality ASR can enable simple, interpretable lexical models to achieve competitive Alzheimer's detection performance without explicit acoustic modeling. The study provides a reproducible benchmark pipeline and highlights ASR selection as a critical modeling decision in clinical speech-based artificial intelligence systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a benchmark study on Alzheimer's disease detection from spontaneous speech on the ADReSSo 2021 dataset. It derives TF-IDF lexical features from Whisper-base and Whisper-small ASR transcripts, trains Logistic Regression and Linear SVM classifiers under repeated 5x5 stratified cross-validation, and claims that higher-quality (small) transcripts yield statistically significant gains in balanced accuracy (above 0.7850 for Linear SVM), that ASR quality matters more than classifier choice, and that cognitively normal speech shows more precise descriptive language while AD speech exhibits vagueness and hesitation markers. A reproducible pipeline is provided.
Significance. If the statistical claims hold under proper dependence-aware testing, the work supplies a useful public-data benchmark showing that ASR selection can be more load-bearing than model complexity for interpretable lexical AD detection, with direct implications for practical non-invasive screening systems that avoid acoustic feature engineering.
major comments (2)
- [Results / Statistical Validation] The paired statistical testing (abstract and results section) that establishes significant gains for Whisper-small over base transcripts does not specify the exact test (paired t-test vs. Wilcoxon), report p-values or effect sizes, or address dependence among the 25 CV folds arising from overlapping training sets. Standard paired tests assume fold independence; without a permutation test or blocked resampling that respects the CV structure, the headline claim that ASR quality drives performance more than classifier choice cannot be fully evaluated.
- [Results] Table or figure reporting balanced accuracy >0.7850 for Linear SVM on Whisper-small transcripts provides no per-fold standard deviations, confidence intervals, or variance across the repeated CV runs, weakening assessment of result stability and the cross-ASR comparison.
minor comments (2)
- [Abstract] The abstract and methods should explicitly name the statistical test, threshold, and software used for the paired comparison to allow immediate reproducibility.
- [Discussion] Feature analysis claims about vagueness and hesitation patterns in AD speech would benefit from quantitative counts or example n-grams rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major point below and will revise the paper to strengthen the statistical reporting and transparency of our results.
read point-by-point responses
-
Referee: [Results / Statistical Validation] The paired statistical testing (abstract and results section) that establishes significant gains for Whisper-small over base transcripts does not specify the exact test (paired t-test vs. Wilcoxon), report p-values or effect sizes, or address dependence among the 25 CV folds arising from overlapping training sets. Standard paired tests assume fold independence; without a permutation test or blocked resampling that respects the CV structure, the headline claim that ASR quality drives performance more than classifier choice cannot be fully evaluated.
Authors: We agree that the description of the statistical testing was insufficiently detailed. In the revised manuscript we will explicitly state that we applied a paired Wilcoxon signed-rank test to the per-fold balanced accuracy scores, report the exact p-values together with effect sizes (rank-biserial correlation), and add a permutation test that respects the repeated CV structure by randomly shuffling labels within each training partition while preserving the fold structure. These additions will allow readers to fully evaluate the claim that ASR quality contributes more to performance variation than classifier choice. revision: yes
-
Referee: [Results] Table or figure reporting balanced accuracy >0.7850 for Linear SVM on Whisper-small transcripts provides no per-fold standard deviations, confidence intervals, or variance across the repeated CV runs, weakening assessment of result stability and the cross-ASR comparison.
Authors: We acknowledge the omission. The revised results section and associated table will report the mean balanced accuracy together with the standard deviation across the 25 CV folds and 95% confidence intervals obtained via bootstrap resampling of the fold-level scores. This will enable direct assessment of result stability and the reliability of the cross-ASR performance differences. revision: yes
Circularity Check
No circularity: empirical benchmark on external public dataset
full rationale
The paper reports direct empirical comparisons of standard ML pipelines (TF-IDF lexical features, Linear SVM / Logistic Regression) on the public ADReSSo 2021 dataset under repeated 5x5 stratified CV. Performance differences between Whisper-small and Whisper-base transcripts are measured via paired statistical tests on held-out folds. No equations derive a target quantity from a fitted parameter that is then re-used as a prediction, no self-citation supplies a uniqueness theorem or ansatz that bears the central claim, and no known result is merely renamed. The derivation chain consists entirely of reproducible data processing and off-the-shelf classifiers evaluated on external labels, so the reported balanced-accuracy gains and significance statements are independent of the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption ADReSSo 2021 dataset labels accurately reflect clinical Alzheimer's status
- domain assumption Lexical TF-IDF features are sufficient to capture disease-related language differences
Reference graph
Works this paper leans on
-
[1]
K. C. Fraser, J. A. Meltzer, F. Rudzicz, Linguistic features identify alzheimer’s disease in narrative speech, Journal of Alzheimer’s Disease 49 (2) (2016) 407–422.doi:10.3233/JAD-150520
-
[2]
S. de la Fuente Garcia, S. Luz, Evaluating the effect of linguistic and acoustic features on alzheimer’s dementia recognition, Frontiers in Aging Neuroscience 10 (2018) 207.doi:10.3389/fnagi.2018.00207
-
[3]
F. Haider, S. de la Fuente, S. Luz, An investigation of acoustic and linguistic features for alzheimer’s dementia detection, Frontiers in Com- puter Science 2 (2020) 624659.doi:10.3389/fcomp.2020.624659. 18
-
[4]
S. Luz, F. Haider, S. de la Fuente Garcia, Editorial: Alzheimer’s de- mentia recognition through spontaneous speech, Frontiers in Computer Science 3 (2021) 780169.doi:10.3389/fcomp.2021.780169
-
[5]
Z. S. Syed, M. Rashid, R. Naqvi, S. Ehsan, X. Wang, M. ur Rehman, R. Naguib, K. McDonald-Maier, Tackling the adresso challenge 2021: The muet-rmit system for alzheimer’s dementia recognition from spon- taneous speech, in: Proc. Interspeech 2021, 2021, pp. 3810–3814.doi: 10.21437/Interspeech.2021-1761
-
[6]
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via large-scale weak supervision, in: Proceed- ings of the 40th International Conference on Machine Learning, 2023, pp. 28492–28518
work page 2023
-
[7]
S. O. Orimaye, J. S.-M. Wong, K. J. Golden, Predicting probable alzheimer’s disease using linguistic deficits and biomarkers, BMC Bioin- formatics 15 (2014) 34.doi:10.1186/1471-2105-15-34
-
[8]
W. Jarrold, B. Peintner, E. Yeh, R. Krasnow, H. Javitz, G. E. Swan, Disfluencies and discourse perseveration in alzheimer’s disease, Interna- tional Journal of Language & Communication Disorders 49 (6) (2014) 617–628.doi:10.1111/1460-6984.12095
-
[9]
A. Satt, S. Rozenberg, R. Hoory, Efficient emotion recognition from speech using deep learning on spectrograms, in: Proc. Interspeech 2017, 2017, pp. 1089–1093
work page 2017
-
[10]
S. V. Pakhomov, et al., Computerized analysis of speech and language to identifypsycholinguisticcorrelatesoffrontotemporallobardegeneration, Cognitive and Behavioral Neurology 23 (3) (2010) 165–177
work page 2010
- [11]
-
[12]
S. Luz, F. Haider, S. de la Fuente, D. Fromm, B. MacWhinney, Alzheimer’s dementia recognition through spontaneous speech: The adress challenge, in: Proc. Interspeech 2020, 2020, pp. 2172–2176. doi:10.21437/Interspeech.2020-2571. 19
-
[13]
B. MacWhinney, The talkbank system for research on spoken communi- cation, in: The Oxford Handbook of Psycholinguistics, Oxford Univer- sity Press, 2011
work page 2011
-
[14]
A. M. Lanzi, A. K. Saylor, D. Fromm, H. Liu, B. MacWhinney, M. L. Cohen, Dementiabank: Theoretical rationale, protocol, and illustrative analyses, American Journal of Speech-Language Pathology 32 (2) (2023) 426–438.doi:10.1044/2022_AJSLP-22-00281
-
[15]
S. de la Fuente Garcia, S. Luz, Evaluation of computational features for automatic prediction of mild cognitive impairment from speech, Fron- tiers in Aging Neuroscience 12 (2020) 593215.doi:10.3389/fnagi. 2020.593215
-
[16]
J. Chen, Z. Ke, Q. Zhu, Y. Wang, et al., Automatic detection of alzheimer’s disease using spontaneous speech only, Frontiers in Aging Neuroscience 14 (2022) 843456
work page 2022
-
[17]
X. Qi, H. Zhang, et al., Noninvasive automatic detection of alzheimer’s disease from spontaneous speech with prompt-based learning, Frontiers in Aging Neuroscience 15 (2023) 1172960
work page 2023
-
[18]
A.Balagopalan, etal., Comparisonofspeechtechnologiesforalzheimer’s disease detection, Computer Speech & Language (2020)
work page 2020
-
[19]
J. Ramos, Using TF-IDF to determine word relevance in document queries, Proceedings of the First Instructional Conference on Machine Learning (2003)
work page 2003
-
[20]
C.Cortes, V.Vapnik, Support-vectornetworks, MachineLearning20(3) (1995) 273–297.doi:10.1007/BF00994018
- [21]
-
[22]
T.Hastie, R.Tibshirani, J.Friedman, TheElementsofStatisticalLearn- ing, 2nd Edition, Springer, 2009
work page 2009
-
[23]
J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for general- ized linear models via coordinate descent, Journal of Statistical Software 33 (1) (2010) 1–22. 20
work page 2010
-
[24]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008
work page 2017
-
[25]
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, NAACL- HLT (2019) 4171–4186
work page 2019
- [26]
-
[27]
A. Baevski, H. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A frame- work for self-supervised learning of speech representations, Advances in Neural Information Processing Systems 33 (2020) 12449–12460
work page 2020
-
[28]
W. Kong, et al., Self-supervised speech representation learning for neu- rodegenerative disorder detection: A review, IEEE Journal of Biomedi- cal and Health Informatics (2022)
work page 2022
- [29]
- [30]
-
[31]
Y. Wu, et al., A review of automatic speech and language processing for alzheimer’s disease and mild cognitive impairment, IEEE Reviews in Biomedical Engineering (2020)
work page 2020
-
[32]
A. B. R. Shatte, D. M. Hutchinson, S. J. Teague, Machine learning in mental health and related disorders: A systematic review, Journal of Medical Internet Research 21 (5) (2019) e15768
work page 2019
-
[33]
A. L. Beam, I. S. Kohane, Big data and machine learning in health care, JAMA 319 (13) (2018) 1317–1318
work page 2018
-
[34]
A. Rajkomar, J. Dean, I. Kohane, Machine learning in medicine, New England Journal of Medicine 380 (14) (2019) 1347–1358
work page 2019
-
[35]
E. J. Topol, High-performance medicine: the convergence of human and artificial intelligence, Nature Medicine 25 (1) (2019) 44–56. 21
work page 2019
-
[36]
Zhou, et al., Interpretable machine learning for healthcare, Nature Biomedical Engineering (2021)
Y. Zhou, et al., Interpretable machine learning for healthcare, Nature Biomedical Engineering (2021)
work page 2021
-
[37]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830. 22
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.