arxiv: 2604.06758 · v1 · submitted 2026-04-08 · 💻 cs.CL

Multilingual Cognitive Impairment Detection in the Era of Foundation Models

Damar Hoogland , Boshko Koloski , Jaya Caporusso , Tine Kolenik , Ana Zwitter Vitez , Senja Pollak , Christina Manouilidou , Matthew Purver This is my paper

Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords cognitive impairment detectionmultilingual speech analysiszero-shot LLMslinguistic featurestabular modelsspeech transcriptssmall data regimesfoundation models

0 comments

The pith

Supervised tabular models using linguistic features and embeddings outperform zero-shot LLMs for detecting cognitive impairment from speech transcripts in multiple languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares zero-shot large language models against supervised tabular models for classifying cognitive impairment from speech transcripts in English, Slovene, and Korean. Zero-shot LLMs serve as competitive baselines without training, but supervised models that combine engineered linguistic features with transcript embeddings generally achieve better performance. This holds under a leave-one-out protocol on small datasets. A sympathetic reader would care because cognitive impairment detection often faces limited labeled data, and reliable multilingual methods could improve accessibility of screening tools worldwide.

Core claim

Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

What carries the argument

Early and late fusion of engineered linguistic features with transcript embeddings in supervised tabular classifiers under leave-one-out cross-validation, compared against zero-shot LLM prompting on transcript-only, features-only, or combined inputs.

If this is right

Structured linguistic signals and simple fusion-based classifiers remain strong and reliable in small-data CI detection.
The value of adding limited supervision is language-dependent and tied to the availability of richer feature representations.
Combining modalities through early or late fusion yields more reliable results than relying on transcript embeddings or linguistic features alone.
Zero-shot LLMs can still serve as useful no-training baselines when labeled data is extremely scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid approaches may be required for other low-resource languages where foundation models have limited exposure to local linguistic patterns.
Clinical data collection for CI screening could prioritize linguistic feature annotations to maintain consistency across languages.
Scaling to larger datasets or bigger models might reduce but not eliminate the current advantage of engineered features.

Load-bearing premise

Engineered linguistic features remain robust, language-appropriate, and not overfitted to the small datasets used, with leave-one-out cross-validation sufficiently controlling for data scarcity without selection bias.

What would settle it

A larger independent test set where a zero-shot LLM achieves higher accuracy than the best supervised tabular model on any of the three languages would challenge the claim that supervised models generally perform better.

read the original abstract

We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings -- transcript-only, linguistic-features-only, and combined -- with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Supervised models with linguistic features plus embeddings beat zero-shot LLMs for cognitive impairment detection in English, Slovene, and Korean, but LOOCV on small datasets leaves the edge uncertain.

read the letter

The paper's main result is that zero-shot LLMs work as no-training baselines for spotting cognitive impairment from speech transcripts across three languages, yet supervised tabular models that combine engineered linguistic features with embeddings usually do better. Few-shot experiments show the gains from extra labels depend on the language. That comparison, done with early and late fusion options, is the concrete piece here. It extends the usual English-only clinical NLP setups by including Slovene and Korean, and it keeps the focus on small-data conditions where structured features still matter. The abstract is clear that these signals remain reliable, which matches patterns seen in related health text work. The evaluation protocol is the soft spot. Leave-one-out cross-validation on small clinical datasets tends to produce high-variance scores and can introduce bias when features are tuned or selected within the same folds. Without reported per-fold spreads, significance tests between model classes, or an external hold-out, the claim that tabular models generally win is hard to weigh. Dataset sizes and exact feature engineering details would also help judge whether the linguistic signals are robust or partly overfit. This is the kind of incremental but practical comparison that researchers building multilingual health NLP tools would find useful. It is not a foundational shift, but the question is grounded and the setup is reproducible enough to warrant referee time. Send it for review and ask for the missing variance numbers and statistical checks.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates cognitive impairment (CI) classification from speech transcripts across English, Slovene, and Korean. It compares zero-shot LLMs used directly as classifiers (under transcript-only, features-only, and combined inputs) against supervised tabular models trained via leave-one-out cross-validation on engineered linguistic features, transcript embeddings, and early/late fusion variants. The central claim is that zero-shot LLMs offer competitive baselines but supervised tabular models generally perform better, especially with linguistic features included; few-shot experiments show language-dependent gains from limited supervision.

Significance. If the empirical comparisons hold after addressing robustness concerns, the work is significant for clinical NLP: it provides evidence that structured linguistic features remain valuable in small-data, multilingual CI detection settings even when foundation models are available, and it quantifies the limits of pure zero-shot approaches in this domain.

major comments (3)

[Methods] Methods section (LOOCV protocol): The leave-one-out cross-validation on small per-language clinical datasets is presented without reporting per-fold variance, standard deviations, or any statistical significance tests (e.g., McNemar or Wilcoxon tests) comparing zero-shot LLM performance against the supervised tabular models. This directly undermines the load-bearing claim that tabular models 'generally perform better' across languages, as LOOCV on limited samples is known to yield unstable estimates and potential selection bias.
[Results] Results section (model comparisons): The superiority of engineered linguistic features + embeddings over zero-shot LLM baselines is asserted, yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), confidence intervals, or effect sizes for the three languages in the main tables or text. Without these, the directional claim cannot be verified or sized.
[Methods] Feature engineering subsection: The linguistic features are described as 'engineered' and combined with embeddings, but no details are given on language-specific adaptations (e.g., POS taggers or parsers for Slovene/Korean), their validation against overfitting on the small datasets, or ablation showing they capture generalizable CI signals rather than dataset artifacts.

minor comments (2)

[Methods] The three input settings for zero-shot LLMs (transcript-only, linguistic-features-only, combined) would benefit from explicit prompt templates or examples in the text or appendix to allow reproducibility.
[Abstract] The abstract states directional results but omits any numerical performance values; including at least the key F1 or accuracy deltas would improve the summary's informativeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of robustness and clarity that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Methods] Methods section (LOOCV protocol): The leave-one-out cross-validation on small per-language clinical datasets is presented without reporting per-fold variance, standard deviations, or any statistical significance tests (e.g., McNemar or Wilcoxon tests) comparing zero-shot LLM performance against the supervised tabular models. This directly undermines the load-bearing claim that tabular models 'generally perform better' across languages, as LOOCV on limited samples is known to yield unstable estimates and potential selection bias.

Authors: We agree that reporting per-fold statistics and significance tests will strengthen the presentation of our results. In the revised manuscript we will add the standard deviation of accuracy, F1, and AUC across LOOCV folds for every model-language pair. We will also report McNemar tests (or Wilcoxon signed-rank tests on fold-wise scores) between the best supervised tabular models and the zero-shot LLM baselines. We retain LOOCV because the per-language sample sizes are small, but we will explicitly discuss its known limitations and the steps taken to mitigate selection bias. revision: yes
Referee: [Results] Results section (model comparisons): The superiority of engineered linguistic features + embeddings over zero-shot LLM baselines is asserted, yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), confidence intervals, or effect sizes for the three languages in the main tables or text. Without these, the directional claim cannot be verified or sized.

Authors: We will revise the results section and associated tables to present all core metrics (accuracy, macro-F1, AUC) together with 95% confidence intervals and effect sizes for every language and model variant. This will make the magnitude and reliability of the observed differences explicit and directly address the referee's concern. revision: yes
Referee: [Methods] Feature engineering subsection: The linguistic features are described as 'engineered' and combined with embeddings, but no details are given on language-specific adaptations (e.g., POS taggers or parsers for Slovene/Korean), their validation against overfitting on the small datasets, or ablation showing they capture generalizable CI signals rather than dataset artifacts.

Authors: We will expand the feature-engineering subsection with language-specific implementation details, including the exact NLP pipelines and taggers used for Slovene and Korean. We will add an ablation study that isolates the contribution of each feature group and will describe the regularization and cross-validation-based feature-selection procedures employed to reduce overfitting risk. These additions will clarify that the features capture clinically relevant signals beyond dataset-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparison study

full rationale

The paper conducts an empirical evaluation comparing zero-shot LLMs and supervised tabular models on CI classification tasks across three languages, reporting performance metrics obtained via leave-one-out cross-validation on held-out data. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided abstract or described methodology. Central claims rest on measured experimental outcomes rather than any reduction to inputs by construction. Self-citations, if present, do not bear load for any derivation chain since none exists beyond standard empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning evaluation paper. No mathematical derivations, free parameters, or invented theoretical entities are introduced; all claims rest on experimental outcomes from speech data.

pith-pipeline@v0.9.0 · 5480 in / 1066 out tokens · 42154 ms · 2026-05-10T18:17:10.524281+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare zero-shot large language models (LLMs) ... with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Introduction Cognitive impairment (CI) refers to a state in which a person’s cognitive functioning is below the ex- pected level and is a diagnosable condition (Ray and Davidson, 2014). CI can involve varying de- grees of deterioration of cognitive abilities such as memory, attention, executive functioning, and language, and it is often associated with ne...

work page 2014
[2]

prompt individuals to describe a complex vi- sual scene. The resulting descriptions enable qual- itative and quantitative assessment of language production, including lexical retrieval, syntactic for- mulation, fluency, informativeness, and narrative organisation. However, traditional diagnostic tools can be limited in their ability to detect early cognit...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

For example, Luz et al

Related Work Many studies investigating the detection and pre- dictionofcognitivedeclinehaveemployedclassical ML approaches (Huang et al., 2024; Kaser et al., 2024). For example, Luz et al. (2021) employed a range of classical ML models—including linear discriminant analysis, decision trees, k-nearest neighbours, random forests, and support vec- tor machi...

work page 2024
[4]

Datasets To evaluate cross-linguistic generalisability and per- formance stability, we ran parallel experiments on three languages: English, Slovene, and Korean

Data and preprocessing 3.1. Datasets To evaluate cross-linguistic generalisability and per- formance stability, we ran parallel experiments on three languages: English, Slovene, and Korean. The English and Slovene datasets were obtained from corpora of recordings of picture description tasks, while the Korean data came from a corpus of structured intervie...

work page 1994
[5]

Task and Data We study binary classification of CI (AD or MCI, depending on the dataset) versus HC from speech- derived inputs

Modelling Methodology 4.1. Task and Data We study binary classification of CI (AD or MCI, depending on the dataset) versus HC from speech- derived inputs. We evaluate performance sepa- rately for three languages: English, Slovene, and Korean (Section 3.1). All experiments are con- ducted within-language (i.e., training and evalua- tion never mix languages...

work page arXiv
[6]

Few-shot results (embeddings- only; k shots per class) are reported in Table 3

Results and Discussion Leave-one-out (LOO) Macro-F1 results are re- ported in Table 2. Few-shot results (embeddings- only; k shots per class) are reported in Table 3. We discuss the findings by research question. Table 3: Few-shot Macro-F1 (embeddings-only,k shots/class, 3 seeds) vs. LLM zero-shot reference. Best overall per language inbold. MethodkEnglis...

work page arXiv 2021
[7]

Conclusion We evaluated CI detection across three languages (English, Slovene, and Korean) comparing LLM zero-shot prompting, tabular foundation models, and classical ML. LLMs provide usable no-training baselines (best Macro-F1: 0.621), but supervised tabular models—especially those using expert- assisted symbolic features and fusion—achieve substantially...

work page
[8]

Code Availability The source code is publicly available athttps: //github.com/bkolosk1/foundational-c i-detection

work page
[9]

Limitations The Slovene dataset (n=27) exhibits potential con- founds: different recording formats and experi- menters for patient and control groups may inflate embedding-based performance. Two symbolic fea- tures (idea density and syntactic complexity) are unavailable for Korean due to parser limitations, and are therefore treated as missing and imputed...

work page 1994
[10]

Becker, Francois Boller, Oscar L

Bibliographical References James T. Becker, Francois Boller, Oscar L. Lopez, JulianaSaxton,andKennethL.McGonigle.1994. The natural history of alzheimer’s disease: De- scription of study cohort and accuracy of diagno- sis.Archives of Neurology, 51(6):585–594. ShaunaBerube,JodiNonnemacher,CorneliaDem- sky, Shenly Glenn, Sadhvi Saxena, Amy Wright, Donna C Ti...

work page arXiv 1994
[11]

Harold Goodglass and Edith Kaplan

Speech rate adjustment of adults during conversation.Journal of fluency disorders, 57:1– 10. Harold Goodglass and Edith Kaplan. 1983.The assessment of aphasia and related disorders. Hao Guan, John Novoa-Laurentiev, and Li Zhou

work page 1983
[12]

Harald Hampel, Sid E O’Bryant, José L Molinuevo, HenrikZetterberg,ColinLMasters,SimoneLista, StevenJKiddle,RichardBatrla,andKajBlennow

Cd-tron: Leveraging large clinical lan- guage model for early detection of cognitive de- cline from electronic health records.Journal of Biomedical Informatics, page 104830. Harald Hampel, Sid E O’Bryant, José L Molinuevo, HenrikZetterberg,ColinLMasters,SimoneLista, StevenJKiddle,RichardBatrla,andKajBlennow

work page
[13]

Noah Hollmann, Samuel Müller, Lennart Pu- rucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter

Blood-based biomarkers for alzheimer disease: mapping the road to the clinic.Nature Reviews Neurology, 14(11):639–652. Noah Hollmann, Samuel Müller, Lennart Pu- rucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. 2025. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326...

work page arXiv 2025
[14]

Brian MacWhinney, Davida Fromm, Margaret Forbes,andAudreyHolland.2011

Alzheimer’s dementia recognition through spontaneous speech.Frontiers in Computer Sci- ence, 3:780169. Brian MacWhinney, Davida Fromm, Margaret Forbes,andAudreyHolland.2011. Aphasiabank: Methods for studying discourse.Aphasiology, 25:1286–1307. Chengsheng Mao, Jie Xu, Luke Rasmussen, Yikuan Li, Prakash Adekkanattu, Jennifer Pacheco, Borna Bonakdarpour, Ro...

work page arXiv 2011
[15]

Madhurananda Pahar, Fuxiang Tao, Bahman Mirheidari, Nathan Pevy, Rebecca Bright, Swap- nil Gadgil, Lise Sproson, Dorota Braun, Caitlin Illingworth, Daniel Blackburn, et al

Staging dementia using clinical demen- tia rating scale sum of boxes scores: a texas alzheimer’s research consortium study.Archives of Neurology, 65(8):1091–1095. Madhurananda Pahar, Fuxiang Tao, Bahman Mirheidari, Nathan Pevy, Rebecca Bright, Swap- nil Gadgil, Lise Sproson, Dorota Braun, Caitlin Illingworth, Daniel Blackburn, et al. 2025. Cog- nospeak: a...

work page 2025
[16]

Control” or “Patient

Relationship between the montreal cogni- tive assessment and mini-mental state examina- tion for assessment of mild cognitive impairment in older adults.BMC Geriatrics, 15(1):107. Alex H Williams, Erin Kunz, Simon Kornblith, and Scott W Linderman. 2021. Generalized shape metrics on neural representations. InAdvances in Neural Information Processing System...

work page 2021