Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

Chelsea Dominique E. Temprosa; Edric Castel C. Hao; Georgianna Z. Reyes; Hannah Grachiella Bu\~nales; Kervin Gabriel L. Chua; Rez Samantha Z. Floresca

arxiv: 2605.26007 · v1 · pith:JOD3PKGYnew · submitted 2026-05-25 · 💻 cs.CL

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

Rez Samantha Z. Floresca , Edric Castel C. Hao , Hannah Grachiella Bu\~nales , Chelsea Dominique E. Temprosa , Georgianna Z. Reyes , Kervin Gabriel L. Chua This is my paper

Pith reviewed 2026-06-29 21:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords dementia detectionFilipino speechbilingual fine-tuningcross-lingual transferclinical NLPtransformer modelsNeoBERT

0 comments

The pith

Bilingual fine-tuning lets all tested transformer models reach Macro-F1 of 0.969-0.973 for dementia detection in both English and Filipino speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a parallel bilingual dataset of 4,000 DementiaBank transcripts and their manual Filipino translations to test dementia detection across languages. It evaluates TF-IDF, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog under monolingual, zero-shot cross-lingual, and bilingual fine-tuning regimes. English-only training yields poor Filipino results (Macro-F1 drops to 0.455 for BERT), and newer architectures do not close the gap. Bilingual fine-tuning removes the degradation for every model family and produces nearly identical high scores. The results point to linguistic coverage in training data as the dominant factor over model scale or design.

Core claim

In-domain performance does not transfer across languages, architectural modernization alone does not improve robustness, yet bilingual fine-tuning eliminates cross-lingual degradation across all transformer models and converges to Macro-F1 scores of 0.969-0.973; this indicates that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

What carries the argument

The parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manually produced Filipino translations, evaluated under monolingual, zero-shot, and bilingual fine-tuning conditions.

If this is right

Monolingual English training produces near-chance results on Filipino test data for every transformer tested.
Switching to NeoBERT or other modern architectures does not reduce the cross-lingual performance gap.
Bilingual fine-tuning equalizes performance across BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog at Macro-F1 0.969-0.973.
Linguistic coverage during training outweighs model size or architecture for this clinical task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Creating matched bilingual data may be more cost-effective for low-resource clinical NLP than pursuing larger monolingual models.
The same pattern could appear in other code-switching clinical settings where discourse markers cross language boundaries.
Future work could test whether the bilingual advantage persists when translations are generated automatically rather than manually.

Load-bearing premise

Filipino translations produced manually preserve discourse-level markers of cognitive decline.

What would settle it

A controlled study that shows the manual translations systematically alter or remove the discourse features models rely on for dementia classification would falsify the claim that bilingual results reflect genuine cross-language transfer.

Figures

Figures reproduced from arXiv: 2605.26007 by Chelsea Dominique E. Temprosa, Edric Castel C. Hao, Georgianna Z. Reyes, Hannah Grachiella Bu\~nales, Kervin Gabriel L. Chua, Rez Samantha Z. Floresca.

**Figure 2.** Figure 2: TF-IDF + Logistic Regression baseline training pipeline for evaluating lexical separability under cross-lingual and clinical domain shift conditions We use TF-IDF with Logistic Regression as a lexical baseline to assess whether dementia-related [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Transformer Models Fine-tuning Pipeline massively multilingual, and language-matched, enabling a systematic decomposition of cross-lingual robustness along both architectural and pretraining dimensions. 3.3.1 Model Architecture and Fine-Tuning All models follow a standard transformer-based classification pipeline. Given an input transcript x, tokens are first mapped to subword embeddings using a model-spec… view at source ↗

read the original abstract

Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bilingual fine-tuning lifts all models to the same high score, but the result hinges on unvalidated manual translations preserving the dementia signals.

read the letter

The main point is that monolingual English models tank on Filipino (down to 0.455 F1) while bilingual training brings every transformer up to 0.969-0.973, suggesting coverage during training matters more than model choice. That pattern is the paper's clearest empirical contribution.

What is new is the first reported transformer evaluation for dementia detection in Filipino speech and the first use of NeoBERT in any clinical NLP task. They also built a 4,000-transcript parallel set from DementiaBank with manual Filipino translations. The experimental grid—monolingual, zero-shot cross-lingual, and bilingual—is straightforward and lets them separate language effects from domain effects.

The setup is reasonable on its face. Running TF-IDF, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog across the three regimes produces a consistent story that architectural upgrades alone do not fix the transfer failure.

The soft spot is exactly where the stress-test note points: the translations. The abstract states they were produced manually to preserve discourse markers, yet supplies no inter-translator agreement, no validation protocol, and no check that repetition, topic drift, or code-switching patterns survived. If those signals were altered or lost, the bilingual convergence could be an artifact of the parallel data rather than evidence for the coverage hypothesis. The abstract also omits dataset statistics, class balance, training details, and any error analysis or significance tests, so the reported numbers cannot be assessed for robustness yet.

This is for researchers working on clinical NLP in code-switched or low-resource languages. Someone building screening tools for the Philippines or testing multilingual transfer would get a useful data point from the bilingual result, provided the translation step holds up.

It deserves peer review. The gap is real and the question is well-posed; the methods section will decide whether the numbers are reliable.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the first evaluation of transformer models for dementia detection from Filipino speech using a newly constructed parallel bilingual dataset of 4,000 transcripts from DementiaBank, with manual Filipino translations. It compares models like BERT, NeoBERT, XLM-R under monolingual, zero-shot, and bilingual fine-tuning, reporting that bilingual training leads to Macro-F1 scores of 0.969-0.973 across models, while cross-lingual transfer fails (e.g., 0.455 for English BERT on Filipino). The authors argue that linguistic coverage during training, not model scale or architecture, drives performance.

Significance. Should the manual translations be shown to preserve discourse markers of cognitive decline, this would represent a significant contribution as the first work on Filipino dementia detection via NLP and provide empirical support for the importance of multilingual training data in clinical applications. The benchmarking of NeoBERT and the parallel dataset construction are notable strengths. However, without validation of the translations or detailed experimental controls, the central claim remains provisional.

major comments (1)

[Abstract] Abstract: The central claim that bilingual fine-tuning eliminates cross-lingual degradation (Macro-F1 0.969-0.973) and that performance is driven by linguistic coverage relies on the 4,000 parallel transcripts retaining equivalent cognitive-decline signals on the Filipino side. The abstract states that translations were 'produced manually to preserve discourse-level markers of cognitive decline,' yet supplies no protocol, inter-translator agreement, or post-translation validation that features such as topic drift, repetition, or code-switch patterns survive. This assumption is load-bearing for the conclusion.

minor comments (2)

[Abstract] Abstract: No dataset statistics (class balance, sample counts per class), training hyperparameters, or statistical significance tests for the reported Macro-F1 convergence are supplied, which limits evaluation of result robustness even if the translation concern is addressed.
The abstract lists five model families but does not name the precise pre-trained checkpoints or tokenization details used for RoBERTa-Tagalog and NeoBERT.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed feedback on the translation validation issue. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that bilingual fine-tuning eliminates cross-lingual degradation (Macro-F1 0.969-0.973) and that performance is driven by linguistic coverage relies on the 4,000 parallel transcripts retaining equivalent cognitive-decline signals on the Filipino side. The abstract states that translations were 'produced manually to preserve discourse-level markers of cognitive decline,' yet supplies no protocol, inter-translator agreement, or post-translation validation that features such as topic drift, repetition, or code-switch patterns survive. This assumption is load-bearing for the conclusion.

Authors: We agree that the current manuscript lacks a detailed translation protocol, inter-translator agreement metrics, and explicit post-translation validation of discourse features. The translations were performed by bilingual native Filipino speakers with linguistic expertise, following instructions to preserve all original discourse elements including repetitions, hesitations, topic drift, and code-switching. However, no formal agreement study or feature-level validation was conducted. We will revise the manuscript by adding a dedicated Methods subsection describing the translation guidelines and process, and we will add an explicit Limitations paragraph acknowledging the absence of quantitative validation. These changes will be made without altering the reported experimental results or conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with direct measurements

full rationale

The paper reports an empirical evaluation of transformer models on a constructed parallel bilingual dataset of 4000 transcripts. All reported results (Macro-F1 scores under monolingual, zero-shot, and bilingual settings) are direct performance measurements on held-out data. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The central claim that bilingual fine-tuning eliminates cross-lingual degradation follows from the experimental comparisons rather than reducing to any input by construction. The manual translation step is a methodological choice whose validity is external to the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the manually translated parallel corpus faithfully retains cognitive-decline signals; otherwise the bilingual result cannot be attributed to linguistic coverage.

axioms (1)

domain assumption Manual Filipino translations preserve discourse-level markers of cognitive decline
Explicitly invoked in the abstract as the justification for creating the parallel dataset.

pith-pipeline@v0.9.1-grok · 5797 in / 1167 out tokens · 26397 ms · 2026-06-29T21:20:55.943829+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Yuxia Guo, Xu Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá

Linguistic features identify Alzheimer’s disease in narrative speech.Journal of Alzheimer’s Disease, 49(2):407–422. Yuxia Guo, Xu Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá. 2020. Benchmarking of transformer-based pre-trained models on social media text classification datasets. InProceedings of the 18th Annual Workshop of th...

2020
[2]

InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2592–2601, Marseille, France

TweetTaglish: A dataset for investigating Tagalog-English code-switching. InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2592–2601, Marseille, France. European Language Resources Association. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In7th International Conference on Learning Repres...

2019
[3]

Revisiting Small Batch Training for Deep Neural Networks

Language deficits across PET-based Braak stages of tau accumulation in Alzheimer’s disease. Alzheimer’s & Dementia, 22(3):e71286. Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks.arXiv preprint arXiv:1804.07612. Deepthi Mave, Suraj Maharjan, and Thamar Solorio

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

InProceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia

Language identification and analysis of code-switched social media text. InProceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia. Association for Computational Linguistics. Lester James V . Miranda. 2023. Developing a named entity recognition dataset for Tagalog. In Proceedings of the...

2023
[5]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 31239–31273, Vienna, Austria

Batayan: A Filipino NLP benchmark for evaluating large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 31239–31273, Vienna, Austria. Association for Computational Linguistics. Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, and Najim Dehak. 2020. Using s...

2020
[6]

Association for Computational Linguistics

How multilingual is multilingual BERT? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empiric...

2019
[7]

GLU Variants Improve Transformer

Fluent translations from disfluent speech in end-to-end speech translation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2786–2792. D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Mich...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[1] [1]

Yuxia Guo, Xu Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá

Linguistic features identify Alzheimer’s disease in narrative speech.Journal of Alzheimer’s Disease, 49(2):407–422. Yuxia Guo, Xu Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá. 2020. Benchmarking of transformer-based pre-trained models on social media text classification datasets. InProceedings of the 18th Annual Workshop of th...

2020

[2] [2]

InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2592–2601, Marseille, France

TweetTaglish: A dataset for investigating Tagalog-English code-switching. InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2592–2601, Marseille, France. European Language Resources Association. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In7th International Conference on Learning Repres...

2019

[3] [3]

Revisiting Small Batch Training for Deep Neural Networks

Language deficits across PET-based Braak stages of tau accumulation in Alzheimer’s disease. Alzheimer’s & Dementia, 22(3):e71286. Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks.arXiv preprint arXiv:1804.07612. Deepthi Mave, Suraj Maharjan, and Thamar Solorio

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

InProceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia

Language identification and analysis of code-switched social media text. InProceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia. Association for Computational Linguistics. Lester James V . Miranda. 2023. Developing a named entity recognition dataset for Tagalog. In Proceedings of the...

2023

[5] [5]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 31239–31273, Vienna, Austria

Batayan: A Filipino NLP benchmark for evaluating large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 31239–31273, Vienna, Austria. Association for Computational Linguistics. Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, and Najim Dehak. 2020. Using s...

2020

[6] [6]

Association for Computational Linguistics

How multilingual is multilingual BERT? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empiric...

2019

[7] [7]

GLU Variants Improve Transformer

Fluent translations from disfluent speech in end-to-end speech translation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2786–2792. D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Mich...

work page internal anchor Pith review Pith/arXiv arXiv 2019