FLAME: A New Dataset on FLemish Accounts of Momentary Experiences

Katie Hoemann; Niels Vanhasbroeck; Ratna Kandala

arxiv: 2504.14707 · v3 · submitted 2025-04-20 · 💻 cs.CL

FLAME: A New Dataset on FLemish Accounts of Momentary Experiences

Ratna Kandala , Niels Vanhasbroeck , Katie Hoemann This is my paper

Pith reviewed 2026-05-22 18:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords FLAMEFlemish narrativestopic modelingBERTopicLDAhuman evaluationlow-resource languagescultural resonance

0 comments

The pith

BERTopic produces more coherent and culturally resonant topics than LDA or K-Means on a new dataset of Flemish personal narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FLAME, a corpus of nearly 25,000 daily personal narratives written in Flemish. Researchers compare three topic modeling techniques to extract meaningful themes from this informal and culturally specific text. Automated metrics favor LDA, but human evaluators consistently rate BERTopic's output as more coherent and reflective of Flemish culture. This work illustrates why contextual language models matter for low-resource varieties and why human judgment should complement statistical measures in topic analysis.

Core claim

On the FLAME dataset of Flemish accounts of momentary experiences, LDA performs well according to automated coherence metrics, yet human evaluation shows that BERTopic generates the most coherent and culturally resonant topics. K-Means underperforms relative to results on other Dutch corpora, underscoring the challenges of this low-resource, narrative-rich data.

What carries the argument

Benchmarking of K-Means clustering, Latent Dirichlet Allocation (LDA), and BERTopic on the FLAME corpus, with evaluation through both automated coherence scores and human ratings for coherence and cultural resonance.

If this is right

Contextual embeddings prove essential for effective topic modeling in culturally specific low-resource settings.
Human-centered evaluation reveals strengths of embedding-based methods that automated metrics miss.
The FLAME dataset enables new research into everyday themes in underrepresented language varieties.
Purely statistical topic models like K-Means may struggle more with informal narrative data than with other Dutch texts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar human evaluation protocols could improve topic modeling for other minority language varieties in Europe.
The findings suggest that automated metrics should be supplemented with cultural context checks in narrative analysis tasks.
Extensions of the dataset might allow tracking changes in daily experiences over time or across Flemish regions.

Load-bearing premise

That the human raters' assessments of topic coherence and cultural resonance serve as reliable ground truth representative of the Flemish linguistic and cultural context.

What would settle it

A study repeating the human evaluation with a larger, demographically diverse sample of Flemish speakers and measuring inter-rater agreement to check if BERTopic topics remain preferred.

Figures

Figures reproduced from arXiv: 2504.14707 by Katie Hoemann, Niels Vanhasbroeck, Ratna Kandala.

**Figure 1.** Figure 1: BERTopic implementation pipeline 2.3.1 Dimensionality Reduction To address the challenges posed by the high dimensionality of the embeddings, we employ Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018) to reduce the high dimensional embeddings to a low dimensional space. UMAP is a non-linear technique adept at preserving both the local and global structure of the data from the hig… view at source ↗

**Figure 2.** Figure 2: Silhouette Score vs Number of Clusters C. Human Annotation Results [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

read the original abstract

We introduce FLAME (FLemish Accounts of Momentary Experiences), a new corpus of nearly 25,000 daily personal narratives in Belgian-Dutch (Flemish), designed to support research on underrepresented language varieties in Natural Language Processing (NLP). Personal narratives of this kind hold rich potential for uncovering culturally grounded, everyday themes, yet extracting meaningful topics from such data is non-trivial, given the informal register, cultural specificity, and low-resource nature of the Flemish variety. We therefore ask: which topic modeling approach is best suited to reveal the latent themes in this corpus? To answer this, we benchmark three widely used methods: K-Means Clustering, Latent Dirichlet Allocation (LDA), and BERTopic, evaluating their ability to identify coherent and culturally relevant topics. While LDA achieves strong performance on automated coherence metrics, human evaluation reveals that BERTopic consistently produces the most coherent and culturally resonant topics, exposing the limitations of purely statistical methods on narrative-rich data. The diminished performance of K-Means compared to prior work on similar Dutch corpora further highlights the unique linguistic challenges posed by this dataset. Our findings demonstrate that contextual embeddings are critical for robust topic modeling in low-resource, culturally specific domains, and underscore the importance of human-centered evaluation alongside automated metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Flemish narrative corpus is the real addition here, but the human topic evaluation lacks the details needed to back the main claim over automated metrics.

read the letter

The paper introduces FLAME, a corpus of nearly 25,000 Flemish personal narratives, and benchmarks three topic modeling approaches on it. That dataset fills a gap for an underrepresented language variety, and the comparison between K-Means, LDA, and BERTopic is a straightforward way to test what works on informal, culturally specific text. LDA looks strong on the usual coherence numbers, yet the authors say human raters find BERTopic more coherent and resonant. The K-Means results also lag behind earlier Dutch work, which flags real linguistic differences in this data.

Referee Report

2 major / 2 minor

Summary. The paper introduces FLAME, a new corpus of nearly 25,000 daily personal narratives in Flemish (Belgian-Dutch), and benchmarks K-Means clustering, LDA, and BERTopic for topic modeling on this low-resource, culturally specific data. While LDA performs strongly on automated coherence metrics, the authors conclude from human evaluation that BERTopic yields the most coherent and culturally resonant topics, arguing that contextual embeddings are essential for such narrative-rich, informal registers.

Significance. If the human evaluation holds, the work would usefully demonstrate the limitations of purely statistical topic models on culturally grounded personal narratives and provide a new resource for Flemish NLP. The dataset itself is a clear contribution for underrepresented language varieties, and the explicit comparison of automated versus human judgments on coherence and cultural resonance is a strength.

major comments (2)

Human evaluation section: the central claim that BERTopic produces the most coherent and culturally resonant topics rests on human judgments that override LDA's stronger automated coherence scores, yet no information is provided on the number of raters, their linguistic or cultural background, selection criteria, training, or any inter-rater reliability metric (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha). This detail is load-bearing for treating the human results as stable ground truth, especially for culturally specific claims.
Evaluation methodology: the abstract and methods description report both automated and human results but omit sample sizes for the human evaluation, the exact rating scales used for coherence and cultural resonance, and any statistical tests comparing the models. Without these, the superiority claim for BERTopic cannot be fully assessed.

minor comments (2)

The abstract states 'nearly 25,000' narratives; the exact count and any filtering steps should be stated precisely in the dataset section.
Clarify the preprocessing pipeline for the Flemish text (e.g., handling of informal spelling, dialectal variants) before topic modeling, as this affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing the FLAME dataset. The comments on the human evaluation methodology are well-taken and highlight important aspects of transparency. We address each major comment below and will revise the manuscript to incorporate additional details where the original submission was incomplete.

read point-by-point responses

Referee: Human evaluation section: the central claim that BERTopic produces the most coherent and culturally resonant topics rests on human judgments that override LDA's stronger automated coherence scores, yet no information is provided on the number of raters, their linguistic or cultural background, selection criteria, training, or any inter-rater reliability metric (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha). This detail is load-bearing for treating the human results as stable ground truth, especially for culturally specific claims.

Authors: We agree that these details are essential for establishing the reliability of the human judgments, particularly given the culturally specific nature of the claims. The current manuscript does not provide this information. In the revised version, we will expand the human evaluation section to report the number of raters, their linguistic and cultural backgrounds, selection criteria, any training provided, and inter-rater reliability metrics such as Fleiss' kappa. This will allow readers to better evaluate the stability of the results supporting BERTopic's advantages. revision: yes
Referee: Evaluation methodology: the abstract and methods description report both automated and human results but omit sample sizes for the human evaluation, the exact rating scales used for coherence and cultural resonance, and any statistical tests comparing the models. Without these, the superiority claim for BERTopic cannot be fully assessed.

Authors: We acknowledge the need for these methodological specifics to support the superiority claims. The manuscript currently omits sample sizes, exact rating scales, and statistical comparisons. We will revise the methods and evaluation sections to include the sample size for the human evaluation, the precise Likert or other scales used for coherence and cultural resonance ratings, and the results of appropriate statistical tests comparing model performance. These changes will make the evaluation fully assessable. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarking against external human judgments

full rationale

The paper introduces the FLAME corpus and benchmarks K-Means, LDA, and BERTopic via automated coherence metrics plus human evaluation of coherence and cultural resonance. The central claim (BERTopic superiority on human criteria) rests on judgments external to the paper's own definitions or fitted parameters, with no equations, self-definitional loops, or renamings of known results. No load-bearing self-citations or ansatzes are invoked to force the outcome; the work is self-contained empirical comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human topic judgments provide a valid external benchmark for cultural relevance; no free parameters or invented entities are introduced beyond standard topic-model hyperparameters.

axioms (1)

domain assumption Human raters can reliably judge topic coherence and cultural resonance for Flemish narratives
Invoked when the paper states that human evaluation reveals BERTopic's superiority

pith-pipeline@v0.9.0 · 5756 in / 1194 out tokens · 24453 ms · 2026-05-22T18:56:50.007965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Computational Linguistics 34(4), 555–596 (2008)

A survey of text classification algorithms. In Mining Text Data, pages 163–222. Springer, Boston, MA. Artstein, R., & Poesio, M. (2008). Inter -coder agreement for computational linguistics. Computational Linguistics, 34 (4), 555 –596. https://doi.org/10.1162/coli.07-034-R2. Bastian Tamm, Jordi Poncelet, Manon Barberis, and Marjolein Vandermosten

work page doi:10.1162/coli.07-034-r2 2008
[2]

In Dutch Speech Tech Day 2024, Hilversum, The Netherlands

Weakly supervised training improves Flemish ASR of non-standard speech. In Dutch Speech Tech Day 2024, Hilversum, The Netherlands. Berkhin, Pavel

work page 2024
[3]

In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71

A survey of clustering data mining techniques. In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71. Springer. Bhatia, S., Lau, J. H., & Baldwin, T. (2018). Topic intrusion for automatic topic model evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. ...

work page doi:10.18653/v1/d18-1238 2018
[4]

Journal of Machine Learning Research, 3:993–1022

Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, 22 (NeurIPS). Campello, Ricardo J. G. B., Davoud Moulavi, and Jörg Sander

work page 2009
[5]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. Kamiloğlu, Roza G., et al

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Emotion, 25(1):271–276

What makes us feel good? A data -driven investigation of positive emotion experience. Emotion, 25(1):271–276. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Sage Publications. Lau, Jey Han, David Newman, and Timothy Baldwin

work page 2004
[7]

Least squares quantization in

Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 530 –539, Gothenburg, Sweden. Association for Computational Linguistics. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Tra...

work page doi:10.1109/tit.1982.1056489 1982
[8]

D., Raghavan, P., & Schütze, H

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum

work page 2008
[9]

robbert-2022-dutch-sentence-transformers (revision cdf42f6). https://huggingface.co/NFI/robbert-2022-dutch-sentence- transformers Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (201...

work page 2022
[10]

Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C

Lawrence Erlbaum Associates, Mahwah, NJ. Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101 –108). Association for Computational Linguist...

work page doi:10.18653/v1/2020.acl-demos.14 2020
[11]

Sentence -BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages 3982 –3992. Association for Computational Linguistics. Röder, Michael, Andreas Both, and Alexander Hinneburg

work page 2019
[12]

Journal of Computational and Applied Mathematics, 20:53–65

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. Salton, G., & Buckley, C. (1988). Term -weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513 –523. https://doi.org/10.1016/0306-4573(88)90021-0. Schäfer, Karla, Jeon...

work page doi:10.1016/0306-4573(88)90021-0 1988
[13]

arXiv preprint arXiv:2409.10173

jina-embeddings-v3: Multilingual embeddings with task LoRA. arXiv preprint arXiv:2409.10173. Zhang, Xin, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al

work page arXiv
[14]

What is going on now or since the last prompt, and how do you feel about it?

mGTE: Generalized long- context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412. Appendix: A. Participants Participants were a community sample recruited through flyers, online posts, and word-of-mouth. To be...

work page 2024

[1] [1]

Computational Linguistics 34(4), 555–596 (2008)

A survey of text classification algorithms. In Mining Text Data, pages 163–222. Springer, Boston, MA. Artstein, R., & Poesio, M. (2008). Inter -coder agreement for computational linguistics. Computational Linguistics, 34 (4), 555 –596. https://doi.org/10.1162/coli.07-034-R2. Bastian Tamm, Jordi Poncelet, Manon Barberis, and Marjolein Vandermosten

work page doi:10.1162/coli.07-034-r2 2008

[2] [2]

In Dutch Speech Tech Day 2024, Hilversum, The Netherlands

Weakly supervised training improves Flemish ASR of non-standard speech. In Dutch Speech Tech Day 2024, Hilversum, The Netherlands. Berkhin, Pavel

work page 2024

[3] [3]

In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71

A survey of clustering data mining techniques. In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71. Springer. Bhatia, S., Lau, J. H., & Baldwin, T. (2018). Topic intrusion for automatic topic model evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. ...

work page doi:10.18653/v1/d18-1238 2018

[4] [4]

Journal of Machine Learning Research, 3:993–1022

Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, 22 (NeurIPS). Campello, Ricardo J. G. B., Davoud Moulavi, and Jörg Sander

work page 2009

[5] [5]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. Kamiloğlu, Roza G., et al

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Emotion, 25(1):271–276

What makes us feel good? A data -driven investigation of positive emotion experience. Emotion, 25(1):271–276. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Sage Publications. Lau, Jey Han, David Newman, and Timothy Baldwin

work page 2004

[7] [7]

Least squares quantization in

Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 530 –539, Gothenburg, Sweden. Association for Computational Linguistics. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Tra...

work page doi:10.1109/tit.1982.1056489 1982

[8] [8]

D., Raghavan, P., & Schütze, H

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum

work page 2008

[9] [9]

robbert-2022-dutch-sentence-transformers (revision cdf42f6). https://huggingface.co/NFI/robbert-2022-dutch-sentence- transformers Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (201...

work page 2022

[10] [10]

Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C

Lawrence Erlbaum Associates, Mahwah, NJ. Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101 –108). Association for Computational Linguist...

work page doi:10.18653/v1/2020.acl-demos.14 2020

[11] [11]

Sentence -BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages 3982 –3992. Association for Computational Linguistics. Röder, Michael, Andreas Both, and Alexander Hinneburg

work page 2019

[12] [12]

Journal of Computational and Applied Mathematics, 20:53–65

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. Salton, G., & Buckley, C. (1988). Term -weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513 –523. https://doi.org/10.1016/0306-4573(88)90021-0. Schäfer, Karla, Jeon...

work page doi:10.1016/0306-4573(88)90021-0 1988

[13] [13]

arXiv preprint arXiv:2409.10173

jina-embeddings-v3: Multilingual embeddings with task LoRA. arXiv preprint arXiv:2409.10173. Zhang, Xin, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al

work page arXiv

[14] [14]

What is going on now or since the last prompt, and how do you feel about it?

mGTE: Generalized long- context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412. Appendix: A. Participants Participants were a community sample recruited through flyers, online posts, and word-of-mouth. To be...

work page 2024