Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Eiji Aramaki; Lenard Paulo Tamayo; Shaowen Peng; Shohei Hisada; Shoko Wakamiya; Si-Belkacem Yamine Ketir

arxiv: 2605.16077 · v1 · pith:JETWKLJPnew · submitted 2026-05-15 · 💻 cs.CL

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Si-Belkacem Yamine Ketir , Lenard Paulo Tamayo , Shohei Hisada , Shaowen Peng , Shoko Wakamiya , Eiji Aramaki This is my paper

Pith reviewed 2026-05-20 19:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords data augmentationLLMcognitive assessmentspeech analysisclass imbalanceHasegawa Dementia ScaleSentence-BERTsemantic similarity

0 comments

The pith

Similarity-guided selection of GPT-5 monologues improves cognitive score prediction from speech by balancing classes and cutting errors for low-score cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that written responses can anchor GPT-5 generation of oral-style monologues, and that selecting the synthetic samples by semantic similarity to real speech produces better augmentation than random selection. This leads to more stable gains in Partial Least Squares regression on Sentence-BERT embeddings when predicting Hasegawa Dementia Scale scores from a Japanese corpus. A reader would care because clinical speech datasets are typically small and imbalanced, so this offers a way to improve accuracy for scarce low-score participants without needing more real recordings. The method keeps performance steady for the majority high-score group while lowering error on the minority group.

Core claim

Similarity-guided class-balanced selection of GPT-5-generated oral-like monologues, using written responses as semantic anchors, yields more consistent improvements and substantially reduces prediction error for minority low-score participants while maintaining performance for the majority group.

What carries the argument

Similarity-guided class-balanced selection that prioritizes GPT-5 synthetic samples whose Sentence-BERT embeddings are closest to real spontaneous speech embeddings.

Load-bearing premise

GPT-5 rewrites of written responses produce synthetic monologues whose Sentence-BERT embeddings carry the same cognitive-status signal as real spontaneous speech, especially for low-score participants.

What would settle it

A controlled experiment in which adding the similarity-selected synthetic samples fails to lower, or even raises, mean prediction error on held-out low-score cases relative to the unaugmented baseline.

Figures

Figures reproduced from arXiv: 2605.16077 by Eiji Aramaki, Lenard Paulo Tamayo, Shaowen Peng, Shohei Hisada, Shoko Wakamiya, Si-Belkacem Yamine Ketir.

**Figure 1.** Figure 1: Overview of the proposed LLM-driven data augmentation framework for cognitive score prediction from speech. Underlined terms indicate oral markers, and terms in red indicate stylistic features. To overcome these limitations, spontaneous speech analysis has emerged as a non-invasive and cost-effective biomarker for cognitive health (Lima et al., 2025). Language, as a complex cognitive task integrating mem… view at source ↗

**Figure 2.** Figure 2: Distribution of HDS score classes before [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LLM-based data augmentation framework for predicting cognitive scores from speech. written narratives are generally more structured than the oral responses and may contain additional details, as they are produced without time pressure and allow for greater reflection. To generate high-quality synthetic data, we use GPT-5, which provides the best performance among currently available LLMs for text generati… view at source ↗

**Figure 4.** Figure 4: 3 shows the evolution of RMSE and R2 as a function of the number of synthetic samples generated per patient, for several data augmentation strategies. As a first observation, all augmentation methods improve performance compared to the baseline without augmentation, with lower RMSE and higher R2. This confirms the relevance of synthetic data in low-data settings. Building on this, LLM-driven approaches co… view at source ↗

**Figure 6.** Figure 6: Distribution of selected linguistic styles [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: True vs. predicted HDS scores for the best-performing similarity-based model. fied evaluation by calculating the MAE separately for the minority (HDS 22–27) and majority (HDS 28–30) groups, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an LLM-driven data augmentation framework for predicting cognitive scores from spontaneous speech. Written responses to clinical prompts serve as semantic anchors for GPT-5 to generate multiple oral-like monologues in varying styles. These synthetic samples are embedded with Sentence-BERT and used to train a Partial Least Squares regression model on Hasegawa Dementia Scale scores from a Japanese corpus. Two augmentation strategies are compared: random class-balanced selection (moderate but unstable gains) and similarity-guided class-balanced selection (more consistent gains that substantially reduce error for minority low-score participants while preserving majority-group performance).

Significance. If the quantitative results hold, the work would be significant for clinical NLP by offering a scalable way to mitigate small sample sizes and class imbalance in cognitive assessment from speech. The similarity-guided selection mechanism is a principled contribution that could generalize to other imbalanced clinical prediction tasks, improving data efficiency without new patient recruitment.

major comments (2)

[Methods (augmentation pipeline and embedding step)] The central claim that similarity-guided augmentation substantially reduces prediction error for low-score minority participants depends on the untested premise that Sentence-BERT embeddings of GPT-5-generated monologues preserve the same cognitive-status variance as real spontaneous speech. The pipeline starts from written anchors (lacking disfluencies and prosody) and applies style transfer; no distributional comparison, embedding-space analysis, or ablation on real vs. synthetic low-score samples is described to confirm that impairment-related signal is retained rather than prompt or LLM artifacts. This assumption is load-bearing for attributing error reduction to added signal.
[Results and Experiments] No quantitative metrics, error bars, dataset sizes, number of generated samples per participant, or statistical tests are reported to support the claims of 'more consistent improvements' and 'substantially reducing prediction error.' The abstract and results description remain qualitative, preventing assessment of effect size, reproducibility, or whether the minority-class gains exceed what would be expected from simple sample-size increase.

minor comments (1)

[Methods] Clarify the exact GPT model version used and whether any post-processing (e.g., length normalization or disfluency insertion) was applied to the generated monologues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects of methodological validation and quantitative reporting. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Methods (augmentation pipeline and embedding step)] The central claim that similarity-guided augmentation substantially reduces prediction error for low-score minority participants depends on the untested premise that Sentence-BERT embeddings of GPT-5-generated monologues preserve the same cognitive-status variance as real spontaneous speech. The pipeline starts from written anchors (lacking disfluencies and prosody) and applies style transfer; no distributional comparison, embedding-space analysis, or ablation on real vs. synthetic low-score samples is described to confirm that impairment-related signal is retained rather than prompt or LLM artifacts. This assumption is load-bearing for attributing error reduction to added signal.

Authors: We agree that direct validation of whether the synthetic embeddings retain impairment-related variance is necessary to fully support the attribution of gains to added signal rather than artifacts. The original manuscript relies on the downstream performance improvements under similarity-guided selection as indirect evidence, but does not include explicit checks. In the revised version we will add (i) a comparison of embedding distributions (e.g., cosine similarity histograms and PCA visualizations) between real spontaneous speech and the GPT-5-generated monologues for low-score participants, and (ii) an ablation that trains the PLS model with and without the synthetic low-score samples to quantify their specific contribution. revision: yes
Referee: [Results and Experiments] No quantitative metrics, error bars, dataset sizes, number of generated samples per participant, or statistical tests are reported to support the claims of 'more consistent improvements' and 'substantially reducing prediction error.' The abstract and results description remain qualitative, preventing assessment of effect size, reproducibility, or whether the minority-class gains exceed what would be expected from simple sample-size increase.

Authors: We acknowledge that the current presentation is primarily qualitative and that this limits evaluation of effect sizes and statistical reliability. The revised manuscript will report concrete numbers: dataset sizes (number of real participants and total samples after augmentation), the exact number of synthetic monologues generated per written anchor, mean absolute error (MAE) and root-mean-square error (RMSE) with standard deviations across cross-validation folds, and statistical comparisons (e.g., paired Wilcoxon tests) between the random and similarity-guided strategies, with separate reporting for the low-score minority subgroup versus the majority group. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical augmentation and prediction pipeline

full rationale

The paper describes an empirical ML pipeline: written responses serve as semantic anchors for GPT-5 generation of oral-like monologues, followed by similarity-guided class-balanced selection, Sentence-BERT embeddings, and Partial Least Squares regression to predict Hasegawa Dementia Scale scores. No equations, derivations, or first-principles results are presented that reduce the claimed improvements to inputs by construction. The reported gains for low-score participants arise from experimental comparisons on a held-out Japanese corpus rather than any self-definitional fitting, renamed known result, or load-bearing self-citation chain. All components (GPT-5, Sentence-BERT, PLS) are external standard tools, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the untested premise that LLM-generated texts preserve the cognitive-relevant semantic features present in real spontaneous speech; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Sentence-BERT embeddings of synthetic oral-like texts carry the same cognitive-status information as embeddings of real spontaneous speech.
Invoked when written responses are treated as semantic anchors for generating samples used in score prediction.

pith-pipeline@v0.9.0 · 5751 in / 1384 out tokens · 46647 ms · 2026-05-20T19:09:17.883793+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5... similarity-guided class-balanced selection... Partial Least Squares regression model trained on Sentence-BERT speech embeddings.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

similarity-guided class-balanced selection prioritizes semantically close synthetic samples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Introduction The global increase in life expectancy has made dementia one of the most pressing public health challengesofthe21stcentury,withcasesexpected to triple by 2050 (Livingston et al., 2020; World Health Organization, 2019). In the absence of curative treatments, early detection of cognitive decline is critical for enabling timely interventions. Al...

work page internal anchor Pith review Pith/arXiv arXiv 2050
[2]

Related Work Automatic assessment of cognitive decline from spontaneous speech has attracted growing inter- est in both computational linguistics and clinical AI. Early studies primarily focus on binary clas- sification of dementia, often relying on publicly available speech corpora such as DementiaBank (Fraser et al., 2016), which also inspired standard-...

work page 2016
[3]

Pleasetellusabout the last good thing that happened to you

Task Formulation and Dataset 3.1. Task Formulation The objective of this study is to predict cognitive scores from speech data using a regression ap- proach, with the ultimate goal of supporting early detection of cognitive decline. In our setting, each participant provides a spontaneous narrative in re- sponse to a single standardized cognitive prompt, a...

work page
[4]

orig- inal

Proposed Method 4.1. Overview As illustrated in Figures 1 and 3, we propose a purely natural language processing (NLP)-based framework to augment speech data for HDS score regression. For each patient, multiple synthetic oral-style monologues are generated using a large language model, conditioned on the patient’s writ- ten narrative and associated HDS sc...

work page 2025
[5]

Data Generation Details The dataset is constructed from 30 original oral transcriptions, each corresponding to a distinct pa- tient

Experimental Setup 5.1. Data Generation Details The dataset is constructed from 30 original oral transcriptions, each corresponding to a distinct pa- tient. For each patient, seven synthetic oral-style monologues are generated using the procedure described in Section 4, resulting in a total of 210 synthetic samples. The complete pool of available data the...

work page 2019
[6]

Results and Discussion 6.1. Effect of Synthetic Data Augmentation Figure 43 shows the evolution of RMSE and R2as a function of the number of synthetic samples gen- erated per patient, for several data augmentation strategies. As a first observation, all augmentation methods improve performance compared to the baseline withoutaugmentation, withlower RMSEan...

work page
[7]

First, the dataset exhibits severe class imbal- ances

Limitations Despite the promising results of similarity-guided class-balanced augmentation, several limitations of the current approach should be acknowledged. First, the dataset exhibits severe class imbal- ances. Some HDS score classes, such as 23, 24, or 27, include only a single patient. Under the LOOCV evaluation scheme, holding out such in- dividual...

work page
[8]

Future Work While our current approach focuses on modifying the style of narratives while keeping their content fixed, an important next step would be to explore content modification while preserving the original cognitive style. Such experiments could help dis- entangle the contributions of narrative structure versus lexical style in predicting cognitive...

work page
[9]

Conclusion In this work, we introduced a novel LLM-driven cross-modal-inspired data augmentation frame- work for cognitive score prediction from sponta- neous speech. By leveraging written narratives as semantic anchors, our method generates syn- thetic oral-style monologues that preserve content while introducing stylistic variability, addressing the dua...

work page
[10]

References Aparna Balagopalan, Ben Eyre, Frank Rudzicz, and Jeka Novikova. 2020. To bert or not to bert: Comparing speech and language-based approaches for alzheimer’s disease detection. In Proceedings of Interspeech 2020, pages 2167– 2171. Kathleen C. Fraser, Jed A. Meltzer, and Frank Rudzicz. 2016. Linguistic features identify alzheimer’s disease in nar...

work page 2020
[11]

InProceedings of INTER- SPEECH 2022

Data augmentation for dementia detection in spoken language. InProceedings of INTER- SPEECH 2022. Ildikó Hoffmann, Dezső Németh, Crystal D. Dye, Magdolna Pákáski, Tibor Irinyi, and János Kálmán. 2010. Temporal parameters of spon- taneous speech in alzheimer’s disease.Interna- tional Journal of Speech-Language Pathology, 12(1):29–34. T. Igarashi and M. Nih...

work page 2022
[12]

MariaRitaLima,AndrewCapstick,FatemehGeran- mayeh, Reza Nilforooshan, Maja Mataric, Ravi Vaidyanathan, and Payam Barnaghi

Earlydementiadetectionwithspeechanal- ysis and machine learning techniques.Discover Sustainability, 5:65. MariaRitaLima,AndrewCapstick,FatemehGeran- mayeh, Reza Nilforooshan, Maja Mataric, Ravi Vaidyanathan, and Payam Barnaghi. 2025. Eval- uating spoken language as a biomarker for auto- mated screening of cognitive impairment.Com- munications Medicine, 6(...

work page 2025
[13]

Dementia prevention, intervention, and care: 2020 report of the lancet commission.The Lancet, 396(10248):413–446. S. Maeshima, A. Osawa, K. Kawamura, T. Yoshimura, E. Otaka, Y. Sato, I. Ueda, N. Itoh, I. Kondo, and H. Arai. 2024. Neuropsy- chologicaltestsusedfordementiaassessmentin japan: Current status.Geriatrics & Gerontology International, 24(Suppl 1):...

work page 2020
[14]

You generate natural Japanese spoken monologues

https://openai.com/index/ introducing-gpt-5/. Accessed: 2026-01- 19. Xiaoyan Qi, Qiang Zhou, Jian Dong, and Wei Bao. 2023. Noninvasive automatic detection of alzheimer’s disease from spontaneous speech: a review.Frontiers in Aging Neuroscience, 15:1224723. Nils Reimers and Iryna Gurevych. 2019. Sentence- bert: Sentence embeddings using siamese bert- netwo...

work page 2026

[1] [1]

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Introduction The global increase in life expectancy has made dementia one of the most pressing public health challengesofthe21stcentury,withcasesexpected to triple by 2050 (Livingston et al., 2020; World Health Organization, 2019). In the absence of curative treatments, early detection of cognitive decline is critical for enabling timely interventions. Al...

work page internal anchor Pith review Pith/arXiv arXiv 2050

[2] [2]

Related Work Automatic assessment of cognitive decline from spontaneous speech has attracted growing inter- est in both computational linguistics and clinical AI. Early studies primarily focus on binary clas- sification of dementia, often relying on publicly available speech corpora such as DementiaBank (Fraser et al., 2016), which also inspired standard-...

work page 2016

[3] [3]

Pleasetellusabout the last good thing that happened to you

Task Formulation and Dataset 3.1. Task Formulation The objective of this study is to predict cognitive scores from speech data using a regression ap- proach, with the ultimate goal of supporting early detection of cognitive decline. In our setting, each participant provides a spontaneous narrative in re- sponse to a single standardized cognitive prompt, a...

work page

[4] [4]

orig- inal

Proposed Method 4.1. Overview As illustrated in Figures 1 and 3, we propose a purely natural language processing (NLP)-based framework to augment speech data for HDS score regression. For each patient, multiple synthetic oral-style monologues are generated using a large language model, conditioned on the patient’s writ- ten narrative and associated HDS sc...

work page 2025

[5] [5]

Data Generation Details The dataset is constructed from 30 original oral transcriptions, each corresponding to a distinct pa- tient

Experimental Setup 5.1. Data Generation Details The dataset is constructed from 30 original oral transcriptions, each corresponding to a distinct pa- tient. For each patient, seven synthetic oral-style monologues are generated using the procedure described in Section 4, resulting in a total of 210 synthetic samples. The complete pool of available data the...

work page 2019

[6] [6]

Results and Discussion 6.1. Effect of Synthetic Data Augmentation Figure 43 shows the evolution of RMSE and R2as a function of the number of synthetic samples gen- erated per patient, for several data augmentation strategies. As a first observation, all augmentation methods improve performance compared to the baseline withoutaugmentation, withlower RMSEan...

work page

[7] [7]

First, the dataset exhibits severe class imbal- ances

Limitations Despite the promising results of similarity-guided class-balanced augmentation, several limitations of the current approach should be acknowledged. First, the dataset exhibits severe class imbal- ances. Some HDS score classes, such as 23, 24, or 27, include only a single patient. Under the LOOCV evaluation scheme, holding out such in- dividual...

work page

[8] [8]

Future Work While our current approach focuses on modifying the style of narratives while keeping their content fixed, an important next step would be to explore content modification while preserving the original cognitive style. Such experiments could help dis- entangle the contributions of narrative structure versus lexical style in predicting cognitive...

work page

[9] [9]

Conclusion In this work, we introduced a novel LLM-driven cross-modal-inspired data augmentation frame- work for cognitive score prediction from sponta- neous speech. By leveraging written narratives as semantic anchors, our method generates syn- thetic oral-style monologues that preserve content while introducing stylistic variability, addressing the dua...

work page

[10] [10]

References Aparna Balagopalan, Ben Eyre, Frank Rudzicz, and Jeka Novikova. 2020. To bert or not to bert: Comparing speech and language-based approaches for alzheimer’s disease detection. In Proceedings of Interspeech 2020, pages 2167– 2171. Kathleen C. Fraser, Jed A. Meltzer, and Frank Rudzicz. 2016. Linguistic features identify alzheimer’s disease in nar...

work page 2020

[11] [11]

InProceedings of INTER- SPEECH 2022

Data augmentation for dementia detection in spoken language. InProceedings of INTER- SPEECH 2022. Ildikó Hoffmann, Dezső Németh, Crystal D. Dye, Magdolna Pákáski, Tibor Irinyi, and János Kálmán. 2010. Temporal parameters of spon- taneous speech in alzheimer’s disease.Interna- tional Journal of Speech-Language Pathology, 12(1):29–34. T. Igarashi and M. Nih...

work page 2022

[12] [12]

MariaRitaLima,AndrewCapstick,FatemehGeran- mayeh, Reza Nilforooshan, Maja Mataric, Ravi Vaidyanathan, and Payam Barnaghi

Earlydementiadetectionwithspeechanal- ysis and machine learning techniques.Discover Sustainability, 5:65. MariaRitaLima,AndrewCapstick,FatemehGeran- mayeh, Reza Nilforooshan, Maja Mataric, Ravi Vaidyanathan, and Payam Barnaghi. 2025. Eval- uating spoken language as a biomarker for auto- mated screening of cognitive impairment.Com- munications Medicine, 6(...

work page 2025

[13] [13]

Dementia prevention, intervention, and care: 2020 report of the lancet commission.The Lancet, 396(10248):413–446. S. Maeshima, A. Osawa, K. Kawamura, T. Yoshimura, E. Otaka, Y. Sato, I. Ueda, N. Itoh, I. Kondo, and H. Arai. 2024. Neuropsy- chologicaltestsusedfordementiaassessmentin japan: Current status.Geriatrics & Gerontology International, 24(Suppl 1):...

work page 2020

[14] [14]

You generate natural Japanese spoken monologues

https://openai.com/index/ introducing-gpt-5/. Accessed: 2026-01- 19. Xiaoyan Qi, Qiang Zhou, Jian Dong, and Wei Bao. 2023. Noninvasive automatic detection of alzheimer’s disease from spontaneous speech: a review.Frontiers in Aging Neuroscience, 15:1224723. Nils Reimers and Iryna Gurevych. 2019. Sentence- bert: Sentence embeddings using siamese bert- netwo...

work page 2026