Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech
Pith reviewed 2026-06-29 21:20 UTC · model grok-4.3
The pith
Bilingual fine-tuning lets all tested transformer models reach Macro-F1 of 0.969-0.973 for dementia detection in both English and Filipino speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In-domain performance does not transfer across languages, architectural modernization alone does not improve robustness, yet bilingual fine-tuning eliminates cross-lingual degradation across all transformer models and converges to Macro-F1 scores of 0.969-0.973; this indicates that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.
What carries the argument
The parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manually produced Filipino translations, evaluated under monolingual, zero-shot, and bilingual fine-tuning conditions.
If this is right
- Monolingual English training produces near-chance results on Filipino test data for every transformer tested.
- Switching to NeoBERT or other modern architectures does not reduce the cross-lingual performance gap.
- Bilingual fine-tuning equalizes performance across BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog at Macro-F1 0.969-0.973.
- Linguistic coverage during training outweighs model size or architecture for this clinical task.
Where Pith is reading between the lines
- Creating matched bilingual data may be more cost-effective for low-resource clinical NLP than pursuing larger monolingual models.
- The same pattern could appear in other code-switching clinical settings where discourse markers cross language boundaries.
- Future work could test whether the bilingual advantage persists when translations are generated automatically rather than manually.
Load-bearing premise
Filipino translations produced manually preserve discourse-level markers of cognitive decline.
What would settle it
A controlled study that shows the manual translations systematically alter or remove the discourse features models rely on for dementia classification would falsify the claim that bilingual results reflect genuine cross-language transfer.
Figures
read the original abstract
Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first evaluation of transformer models for dementia detection from Filipino speech using a newly constructed parallel bilingual dataset of 4,000 transcripts from DementiaBank, with manual Filipino translations. It compares models like BERT, NeoBERT, XLM-R under monolingual, zero-shot, and bilingual fine-tuning, reporting that bilingual training leads to Macro-F1 scores of 0.969-0.973 across models, while cross-lingual transfer fails (e.g., 0.455 for English BERT on Filipino). The authors argue that linguistic coverage during training, not model scale or architecture, drives performance.
Significance. Should the manual translations be shown to preserve discourse markers of cognitive decline, this would represent a significant contribution as the first work on Filipino dementia detection via NLP and provide empirical support for the importance of multilingual training data in clinical applications. The benchmarking of NeoBERT and the parallel dataset construction are notable strengths. However, without validation of the translations or detailed experimental controls, the central claim remains provisional.
major comments (1)
- [Abstract] Abstract: The central claim that bilingual fine-tuning eliminates cross-lingual degradation (Macro-F1 0.969-0.973) and that performance is driven by linguistic coverage relies on the 4,000 parallel transcripts retaining equivalent cognitive-decline signals on the Filipino side. The abstract states that translations were 'produced manually to preserve discourse-level markers of cognitive decline,' yet supplies no protocol, inter-translator agreement, or post-translation validation that features such as topic drift, repetition, or code-switch patterns survive. This assumption is load-bearing for the conclusion.
minor comments (2)
- [Abstract] Abstract: No dataset statistics (class balance, sample counts per class), training hyperparameters, or statistical significance tests for the reported Macro-F1 convergence are supplied, which limits evaluation of result robustness even if the translation concern is addressed.
- The abstract lists five model families but does not name the precise pre-trained checkpoints or tokenization details used for RoBERTa-Tagalog and NeoBERT.
Simulated Author's Rebuttal
We thank the referee for their detailed feedback on the translation validation issue. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that bilingual fine-tuning eliminates cross-lingual degradation (Macro-F1 0.969-0.973) and that performance is driven by linguistic coverage relies on the 4,000 parallel transcripts retaining equivalent cognitive-decline signals on the Filipino side. The abstract states that translations were 'produced manually to preserve discourse-level markers of cognitive decline,' yet supplies no protocol, inter-translator agreement, or post-translation validation that features such as topic drift, repetition, or code-switch patterns survive. This assumption is load-bearing for the conclusion.
Authors: We agree that the current manuscript lacks a detailed translation protocol, inter-translator agreement metrics, and explicit post-translation validation of discourse features. The translations were performed by bilingual native Filipino speakers with linguistic expertise, following instructions to preserve all original discourse elements including repetitions, hesitations, topic drift, and code-switching. However, no formal agreement study or feature-level validation was conducted. We will revise the manuscript by adding a dedicated Methods subsection describing the translation guidelines and process, and we will add an explicit Limitations paragraph acknowledging the absence of quantitative validation. These changes will be made without altering the reported experimental results or conclusions. revision: yes
Circularity Check
No circularity: pure empirical benchmarking with direct measurements
full rationale
The paper reports an empirical evaluation of transformer models on a constructed parallel bilingual dataset of 4000 transcripts. All reported results (Macro-F1 scores under monolingual, zero-shot, and bilingual settings) are direct performance measurements on held-out data. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The central claim that bilingual fine-tuning eliminates cross-lingual degradation follows from the experimental comparisons rather than reducing to any input by construction. The manual translation step is a methodological choice whose validity is external to the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Manual Filipino translations preserve discourse-level markers of cognitive decline
Reference graph
Works this paper leans on
-
[1]
Yuxia Guo, Xu Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá
Linguistic features identify Alzheimer’s disease in narrative speech.Journal of Alzheimer’s Disease, 49(2):407–422. Yuxia Guo, Xu Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá. 2020. Benchmarking of transformer-based pre-trained models on social media text classification datasets. InProceedings of the 18th Annual Workshop of th...
2020
-
[2]
InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2592–2601, Marseille, France
TweetTaglish: A dataset for investigating Tagalog-English code-switching. InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2592–2601, Marseille, France. European Language Resources Association. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In7th International Conference on Learning Repres...
2019
-
[3]
Revisiting Small Batch Training for Deep Neural Networks
Language deficits across PET-based Braak stages of tau accumulation in Alzheimer’s disease. Alzheimer’s & Dementia, 22(3):e71286. Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks.arXiv preprint arXiv:1804.07612. Deepthi Mave, Suraj Maharjan, and Thamar Solorio
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
InProceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia
Language identification and analysis of code-switched social media text. InProceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia. Association for Computational Linguistics. Lester James V . Miranda. 2023. Developing a named entity recognition dataset for Tagalog. In Proceedings of the...
2023
-
[5]
InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 31239–31273, Vienna, Austria
Batayan: A Filipino NLP benchmark for evaluating large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 31239–31273, Vienna, Austria. Association for Computational Linguistics. Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, and Najim Dehak. 2020. Using s...
2020
-
[6]
Association for Computational Linguistics
How multilingual is multilingual BERT? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empiric...
2019
-
[7]
GLU Variants Improve Transformer
Fluent translations from disfluent speech in end-to-end speech translation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2786–2792. D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Mich...
work page internal anchor Pith review Pith/arXiv arXiv 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.