Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
Pith reviewed 2026-05-20 19:09 UTC · model grok-4.3
The pith
Similarity-guided selection of GPT-5 monologues improves cognitive score prediction from speech by balancing classes and cutting errors for low-score cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Similarity-guided class-balanced selection of GPT-5-generated oral-like monologues, using written responses as semantic anchors, yields more consistent improvements and substantially reduces prediction error for minority low-score participants while maintaining performance for the majority group.
What carries the argument
Similarity-guided class-balanced selection that prioritizes GPT-5 synthetic samples whose Sentence-BERT embeddings are closest to real spontaneous speech embeddings.
Load-bearing premise
GPT-5 rewrites of written responses produce synthetic monologues whose Sentence-BERT embeddings carry the same cognitive-status signal as real spontaneous speech, especially for low-score participants.
What would settle it
A controlled experiment in which adding the similarity-selected synthetic samples fails to lower, or even raises, mean prediction error on held-out low-score cases relative to the unaugmented baseline.
Figures
read the original abstract
Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an LLM-driven data augmentation framework for predicting cognitive scores from spontaneous speech. Written responses to clinical prompts serve as semantic anchors for GPT-5 to generate multiple oral-like monologues in varying styles. These synthetic samples are embedded with Sentence-BERT and used to train a Partial Least Squares regression model on Hasegawa Dementia Scale scores from a Japanese corpus. Two augmentation strategies are compared: random class-balanced selection (moderate but unstable gains) and similarity-guided class-balanced selection (more consistent gains that substantially reduce error for minority low-score participants while preserving majority-group performance).
Significance. If the quantitative results hold, the work would be significant for clinical NLP by offering a scalable way to mitigate small sample sizes and class imbalance in cognitive assessment from speech. The similarity-guided selection mechanism is a principled contribution that could generalize to other imbalanced clinical prediction tasks, improving data efficiency without new patient recruitment.
major comments (2)
- [Methods (augmentation pipeline and embedding step)] The central claim that similarity-guided augmentation substantially reduces prediction error for low-score minority participants depends on the untested premise that Sentence-BERT embeddings of GPT-5-generated monologues preserve the same cognitive-status variance as real spontaneous speech. The pipeline starts from written anchors (lacking disfluencies and prosody) and applies style transfer; no distributional comparison, embedding-space analysis, or ablation on real vs. synthetic low-score samples is described to confirm that impairment-related signal is retained rather than prompt or LLM artifacts. This assumption is load-bearing for attributing error reduction to added signal.
- [Results and Experiments] No quantitative metrics, error bars, dataset sizes, number of generated samples per participant, or statistical tests are reported to support the claims of 'more consistent improvements' and 'substantially reducing prediction error.' The abstract and results description remain qualitative, preventing assessment of effect size, reproducibility, or whether the minority-class gains exceed what would be expected from simple sample-size increase.
minor comments (1)
- [Methods] Clarify the exact GPT model version used and whether any post-processing (e.g., length normalization or disfluency insertion) was applied to the generated monologues.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which highlight important aspects of methodological validation and quantitative reporting. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods (augmentation pipeline and embedding step)] The central claim that similarity-guided augmentation substantially reduces prediction error for low-score minority participants depends on the untested premise that Sentence-BERT embeddings of GPT-5-generated monologues preserve the same cognitive-status variance as real spontaneous speech. The pipeline starts from written anchors (lacking disfluencies and prosody) and applies style transfer; no distributional comparison, embedding-space analysis, or ablation on real vs. synthetic low-score samples is described to confirm that impairment-related signal is retained rather than prompt or LLM artifacts. This assumption is load-bearing for attributing error reduction to added signal.
Authors: We agree that direct validation of whether the synthetic embeddings retain impairment-related variance is necessary to fully support the attribution of gains to added signal rather than artifacts. The original manuscript relies on the downstream performance improvements under similarity-guided selection as indirect evidence, but does not include explicit checks. In the revised version we will add (i) a comparison of embedding distributions (e.g., cosine similarity histograms and PCA visualizations) between real spontaneous speech and the GPT-5-generated monologues for low-score participants, and (ii) an ablation that trains the PLS model with and without the synthetic low-score samples to quantify their specific contribution. revision: yes
-
Referee: [Results and Experiments] No quantitative metrics, error bars, dataset sizes, number of generated samples per participant, or statistical tests are reported to support the claims of 'more consistent improvements' and 'substantially reducing prediction error.' The abstract and results description remain qualitative, preventing assessment of effect size, reproducibility, or whether the minority-class gains exceed what would be expected from simple sample-size increase.
Authors: We acknowledge that the current presentation is primarily qualitative and that this limits evaluation of effect sizes and statistical reliability. The revised manuscript will report concrete numbers: dataset sizes (number of real participants and total samples after augmentation), the exact number of synthetic monologues generated per written anchor, mean absolute error (MAE) and root-mean-square error (RMSE) with standard deviations across cross-validation folds, and statistical comparisons (e.g., paired Wilcoxon tests) between the random and similarity-guided strategies, with separate reporting for the low-score minority subgroup versus the majority group. revision: yes
Circularity Check
No significant circularity in empirical augmentation and prediction pipeline
full rationale
The paper describes an empirical ML pipeline: written responses serve as semantic anchors for GPT-5 generation of oral-like monologues, followed by similarity-guided class-balanced selection, Sentence-BERT embeddings, and Partial Least Squares regression to predict Hasegawa Dementia Scale scores. No equations, derivations, or first-principles results are presented that reduce the claimed improvements to inputs by construction. The reported gains for low-score participants arise from experimental comparisons on a held-out Japanese corpus rather than any self-definitional fitting, renamed known result, or load-bearing self-citation chain. All components (GPT-5, Sentence-BERT, PLS) are external standard tools, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sentence-BERT embeddings of synthetic oral-like texts carry the same cognitive-status information as embeddings of real spontaneous speech.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5... similarity-guided class-balanced selection... Partial Least Squares regression model trained on Sentence-BERT speech embeddings.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
similarity-guided class-balanced selection prioritizes semantically close synthetic samples
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction The global increase in life expectancy has made dementia one of the most pressing public health challengesofthe21stcentury,withcasesexpected to triple by 2050 (Livingston et al., 2020; World Health Organization, 2019). In the absence of curative treatments, early detection of cognitive decline is critical for enabling timely interventions. Al...
work page internal anchor Pith review Pith/arXiv arXiv 2050
-
[2]
Related Work Automatic assessment of cognitive decline from spontaneous speech has attracted growing inter- est in both computational linguistics and clinical AI. Early studies primarily focus on binary clas- sification of dementia, often relying on publicly available speech corpora such as DementiaBank (Fraser et al., 2016), which also inspired standard-...
work page 2016
-
[3]
Pleasetellusabout the last good thing that happened to you
Task Formulation and Dataset 3.1. Task Formulation The objective of this study is to predict cognitive scores from speech data using a regression ap- proach, with the ultimate goal of supporting early detection of cognitive decline. In our setting, each participant provides a spontaneous narrative in re- sponse to a single standardized cognitive prompt, a...
-
[4]
Proposed Method 4.1. Overview As illustrated in Figures 1 and 3, we propose a purely natural language processing (NLP)-based framework to augment speech data for HDS score regression. For each patient, multiple synthetic oral-style monologues are generated using a large language model, conditioned on the patient’s writ- ten narrative and associated HDS sc...
work page 2025
-
[5]
Experimental Setup 5.1. Data Generation Details The dataset is constructed from 30 original oral transcriptions, each corresponding to a distinct pa- tient. For each patient, seven synthetic oral-style monologues are generated using the procedure described in Section 4, resulting in a total of 210 synthetic samples. The complete pool of available data the...
work page 2019
-
[6]
Results and Discussion 6.1. Effect of Synthetic Data Augmentation Figure 43 shows the evolution of RMSE and R2as a function of the number of synthetic samples gen- erated per patient, for several data augmentation strategies. As a first observation, all augmentation methods improve performance compared to the baseline withoutaugmentation, withlower RMSEan...
-
[7]
First, the dataset exhibits severe class imbal- ances
Limitations Despite the promising results of similarity-guided class-balanced augmentation, several limitations of the current approach should be acknowledged. First, the dataset exhibits severe class imbal- ances. Some HDS score classes, such as 23, 24, or 27, include only a single patient. Under the LOOCV evaluation scheme, holding out such in- dividual...
-
[8]
Future Work While our current approach focuses on modifying the style of narratives while keeping their content fixed, an important next step would be to explore content modification while preserving the original cognitive style. Such experiments could help dis- entangle the contributions of narrative structure versus lexical style in predicting cognitive...
-
[9]
Conclusion In this work, we introduced a novel LLM-driven cross-modal-inspired data augmentation frame- work for cognitive score prediction from sponta- neous speech. By leveraging written narratives as semantic anchors, our method generates syn- thetic oral-style monologues that preserve content while introducing stylistic variability, addressing the dua...
-
[10]
References Aparna Balagopalan, Ben Eyre, Frank Rudzicz, and Jeka Novikova. 2020. To bert or not to bert: Comparing speech and language-based approaches for alzheimer’s disease detection. In Proceedings of Interspeech 2020, pages 2167– 2171. Kathleen C. Fraser, Jed A. Meltzer, and Frank Rudzicz. 2016. Linguistic features identify alzheimer’s disease in nar...
work page 2020
-
[11]
InProceedings of INTER- SPEECH 2022
Data augmentation for dementia detection in spoken language. InProceedings of INTER- SPEECH 2022. Ildikó Hoffmann, Dezső Németh, Crystal D. Dye, Magdolna Pákáski, Tibor Irinyi, and János Kálmán. 2010. Temporal parameters of spon- taneous speech in alzheimer’s disease.Interna- tional Journal of Speech-Language Pathology, 12(1):29–34. T. Igarashi and M. Nih...
work page 2022
-
[12]
Earlydementiadetectionwithspeechanal- ysis and machine learning techniques.Discover Sustainability, 5:65. MariaRitaLima,AndrewCapstick,FatemehGeran- mayeh, Reza Nilforooshan, Maja Mataric, Ravi Vaidyanathan, and Payam Barnaghi. 2025. Eval- uating spoken language as a biomarker for auto- mated screening of cognitive impairment.Com- munications Medicine, 6(...
work page 2025
-
[13]
Dementia prevention, intervention, and care: 2020 report of the lancet commission.The Lancet, 396(10248):413–446. S. Maeshima, A. Osawa, K. Kawamura, T. Yoshimura, E. Otaka, Y. Sato, I. Ueda, N. Itoh, I. Kondo, and H. Arai. 2024. Neuropsy- chologicaltestsusedfordementiaassessmentin japan: Current status.Geriatrics & Gerontology International, 24(Suppl 1):...
work page 2020
-
[14]
You generate natural Japanese spoken monologues
https://openai.com/index/ introducing-gpt-5/. Accessed: 2026-01- 19. Xiaoyan Qi, Qiang Zhou, Jian Dong, and Wei Bao. 2023. Noninvasive automatic detection of alzheimer’s disease from spontaneous speech: a review.Frontiers in Aging Neuroscience, 15:1224723. Nils Reimers and Iryna Gurevych. 2019. Sentence- bert: Sentence embeddings using siamese bert- netwo...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.