FLAME: A New Dataset on FLemish Accounts of Momentary Experiences
Pith reviewed 2026-05-22 18:56 UTC · model grok-4.3
The pith
BERTopic produces more coherent and culturally resonant topics than LDA or K-Means on a new dataset of Flemish personal narratives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the FLAME dataset of Flemish accounts of momentary experiences, LDA performs well according to automated coherence metrics, yet human evaluation shows that BERTopic generates the most coherent and culturally resonant topics. K-Means underperforms relative to results on other Dutch corpora, underscoring the challenges of this low-resource, narrative-rich data.
What carries the argument
Benchmarking of K-Means clustering, Latent Dirichlet Allocation (LDA), and BERTopic on the FLAME corpus, with evaluation through both automated coherence scores and human ratings for coherence and cultural resonance.
If this is right
- Contextual embeddings prove essential for effective topic modeling in culturally specific low-resource settings.
- Human-centered evaluation reveals strengths of embedding-based methods that automated metrics miss.
- The FLAME dataset enables new research into everyday themes in underrepresented language varieties.
- Purely statistical topic models like K-Means may struggle more with informal narrative data than with other Dutch texts.
Where Pith is reading between the lines
- Similar human evaluation protocols could improve topic modeling for other minority language varieties in Europe.
- The findings suggest that automated metrics should be supplemented with cultural context checks in narrative analysis tasks.
- Extensions of the dataset might allow tracking changes in daily experiences over time or across Flemish regions.
Load-bearing premise
That the human raters' assessments of topic coherence and cultural resonance serve as reliable ground truth representative of the Flemish linguistic and cultural context.
What would settle it
A study repeating the human evaluation with a larger, demographically diverse sample of Flemish speakers and measuring inter-rater agreement to check if BERTopic topics remain preferred.
Figures
read the original abstract
We introduce FLAME (FLemish Accounts of Momentary Experiences), a new corpus of nearly 25,000 daily personal narratives in Belgian-Dutch (Flemish), designed to support research on underrepresented language varieties in Natural Language Processing (NLP). Personal narratives of this kind hold rich potential for uncovering culturally grounded, everyday themes, yet extracting meaningful topics from such data is non-trivial, given the informal register, cultural specificity, and low-resource nature of the Flemish variety. We therefore ask: which topic modeling approach is best suited to reveal the latent themes in this corpus? To answer this, we benchmark three widely used methods: K-Means Clustering, Latent Dirichlet Allocation (LDA), and BERTopic, evaluating their ability to identify coherent and culturally relevant topics. While LDA achieves strong performance on automated coherence metrics, human evaluation reveals that BERTopic consistently produces the most coherent and culturally resonant topics, exposing the limitations of purely statistical methods on narrative-rich data. The diminished performance of K-Means compared to prior work on similar Dutch corpora further highlights the unique linguistic challenges posed by this dataset. Our findings demonstrate that contextual embeddings are critical for robust topic modeling in low-resource, culturally specific domains, and underscore the importance of human-centered evaluation alongside automated metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FLAME, a new corpus of nearly 25,000 daily personal narratives in Flemish (Belgian-Dutch), and benchmarks K-Means clustering, LDA, and BERTopic for topic modeling on this low-resource, culturally specific data. While LDA performs strongly on automated coherence metrics, the authors conclude from human evaluation that BERTopic yields the most coherent and culturally resonant topics, arguing that contextual embeddings are essential for such narrative-rich, informal registers.
Significance. If the human evaluation holds, the work would usefully demonstrate the limitations of purely statistical topic models on culturally grounded personal narratives and provide a new resource for Flemish NLP. The dataset itself is a clear contribution for underrepresented language varieties, and the explicit comparison of automated versus human judgments on coherence and cultural resonance is a strength.
major comments (2)
- Human evaluation section: the central claim that BERTopic produces the most coherent and culturally resonant topics rests on human judgments that override LDA's stronger automated coherence scores, yet no information is provided on the number of raters, their linguistic or cultural background, selection criteria, training, or any inter-rater reliability metric (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha). This detail is load-bearing for treating the human results as stable ground truth, especially for culturally specific claims.
- Evaluation methodology: the abstract and methods description report both automated and human results but omit sample sizes for the human evaluation, the exact rating scales used for coherence and cultural resonance, and any statistical tests comparing the models. Without these, the superiority claim for BERTopic cannot be fully assessed.
minor comments (2)
- The abstract states 'nearly 25,000' narratives; the exact count and any filtering steps should be stated precisely in the dataset section.
- Clarify the preprocessing pipeline for the Flemish text (e.g., handling of informal spelling, dialectal variants) before topic modeling, as this affects reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript introducing the FLAME dataset. The comments on the human evaluation methodology are well-taken and highlight important aspects of transparency. We address each major comment below and will revise the manuscript to incorporate additional details where the original submission was incomplete.
read point-by-point responses
-
Referee: Human evaluation section: the central claim that BERTopic produces the most coherent and culturally resonant topics rests on human judgments that override LDA's stronger automated coherence scores, yet no information is provided on the number of raters, their linguistic or cultural background, selection criteria, training, or any inter-rater reliability metric (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha). This detail is load-bearing for treating the human results as stable ground truth, especially for culturally specific claims.
Authors: We agree that these details are essential for establishing the reliability of the human judgments, particularly given the culturally specific nature of the claims. The current manuscript does not provide this information. In the revised version, we will expand the human evaluation section to report the number of raters, their linguistic and cultural backgrounds, selection criteria, any training provided, and inter-rater reliability metrics such as Fleiss' kappa. This will allow readers to better evaluate the stability of the results supporting BERTopic's advantages. revision: yes
-
Referee: Evaluation methodology: the abstract and methods description report both automated and human results but omit sample sizes for the human evaluation, the exact rating scales used for coherence and cultural resonance, and any statistical tests comparing the models. Without these, the superiority claim for BERTopic cannot be fully assessed.
Authors: We acknowledge the need for these methodological specifics to support the superiority claims. The manuscript currently omits sample sizes, exact rating scales, and statistical comparisons. We will revise the methods and evaluation sections to include the sample size for the human evaluation, the precise Likert or other scales used for coherence and cultural resonance ratings, and the results of appropriate statistical tests comparing model performance. These changes will make the evaluation fully assessable. revision: yes
Circularity Check
No significant circularity: empirical benchmarking against external human judgments
full rationale
The paper introduces the FLAME corpus and benchmarks K-Means, LDA, and BERTopic via automated coherence metrics plus human evaluation of coherence and cultural resonance. The central claim (BERTopic superiority on human criteria) rests on judgments external to the paper's own definitions or fitted parameters, with no equations, self-definitional loops, or renamings of known results. No load-bearing self-citations or ansatzes are invoked to force the outcome; the work is self-contained empirical comparison.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human raters can reliably judge topic coherence and cultural resonance for Flemish narratives
Reference graph
Works this paper leans on
-
[1]
Computational Linguistics 34(4), 555–596 (2008)
A survey of text classification algorithms. In Mining Text Data, pages 163–222. Springer, Boston, MA. Artstein, R., & Poesio, M. (2008). Inter -coder agreement for computational linguistics. Computational Linguistics, 34 (4), 555 –596. https://doi.org/10.1162/coli.07-034-R2. Bastian Tamm, Jordi Poncelet, Manon Barberis, and Marjolein Vandermosten
-
[2]
In Dutch Speech Tech Day 2024, Hilversum, The Netherlands
Weakly supervised training improves Flemish ASR of non-standard speech. In Dutch Speech Tech Day 2024, Hilversum, The Netherlands. Berkhin, Pavel
work page 2024
-
[3]
A survey of clustering data mining techniques. In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71. Springer. Bhatia, S., Lau, J. H., & Baldwin, T. (2018). Topic intrusion for automatic topic model evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. ...
-
[4]
Journal of Machine Learning Research, 3:993–1022
Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, 22 (NeurIPS). Campello, Ricardo J. G. B., Davoud Moulavi, and Jörg Sander
work page 2009
-
[5]
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. Kamiloğlu, Roza G., et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
What makes us feel good? A data -driven investigation of positive emotion experience. Emotion, 25(1):271–276. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Sage Publications. Lau, Jey Han, David Newman, and Timothy Baldwin
work page 2004
-
[7]
Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 530 –539, Gothenburg, Sweden. Association for Computational Linguistics. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Tra...
-
[8]
D., Raghavan, P., & Schütze, H
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum
work page 2008
-
[9]
robbert-2022-dutch-sentence-transformers (revision cdf42f6). https://huggingface.co/NFI/robbert-2022-dutch-sentence- transformers Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (201...
work page 2022
-
[10]
Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C
Lawrence Erlbaum Associates, Mahwah, NJ. Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101 –108). Association for Computational Linguist...
-
[11]
Sentence -BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages 3982 –3992. Association for Computational Linguistics. Röder, Michael, Andreas Both, and Alexander Hinneburg
work page 2019
-
[12]
Journal of Computational and Applied Mathematics, 20:53–65
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. Salton, G., & Buckley, C. (1988). Term -weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513 –523. https://doi.org/10.1016/0306-4573(88)90021-0. Schäfer, Karla, Jeon...
-
[13]
arXiv preprint arXiv:2409.10173
jina-embeddings-v3: Multilingual embeddings with task LoRA. arXiv preprint arXiv:2409.10173. Zhang, Xin, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al
-
[14]
What is going on now or since the last prompt, and how do you feel about it?
mGTE: Generalized long- context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412. Appendix: A. Participants Participants were a community sample recruited through flyers, online posts, and word-of-mouth. To be...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.