pith. sign in

arxiv: 2504.14707 · v3 · submitted 2025-04-20 · 💻 cs.CL

FLAME: A New Dataset on FLemish Accounts of Momentary Experiences

Pith reviewed 2026-05-22 18:56 UTC · model grok-4.3

classification 💻 cs.CL
keywords FLAMEFlemish narrativestopic modelingBERTopicLDAhuman evaluationlow-resource languagescultural resonance
0
0 comments X

The pith

BERTopic produces more coherent and culturally resonant topics than LDA or K-Means on a new dataset of Flemish personal narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FLAME, a corpus of nearly 25,000 daily personal narratives written in Flemish. Researchers compare three topic modeling techniques to extract meaningful themes from this informal and culturally specific text. Automated metrics favor LDA, but human evaluators consistently rate BERTopic's output as more coherent and reflective of Flemish culture. This work illustrates why contextual language models matter for low-resource varieties and why human judgment should complement statistical measures in topic analysis.

Core claim

On the FLAME dataset of Flemish accounts of momentary experiences, LDA performs well according to automated coherence metrics, yet human evaluation shows that BERTopic generates the most coherent and culturally resonant topics. K-Means underperforms relative to results on other Dutch corpora, underscoring the challenges of this low-resource, narrative-rich data.

What carries the argument

Benchmarking of K-Means clustering, Latent Dirichlet Allocation (LDA), and BERTopic on the FLAME corpus, with evaluation through both automated coherence scores and human ratings for coherence and cultural resonance.

If this is right

  • Contextual embeddings prove essential for effective topic modeling in culturally specific low-resource settings.
  • Human-centered evaluation reveals strengths of embedding-based methods that automated metrics miss.
  • The FLAME dataset enables new research into everyday themes in underrepresented language varieties.
  • Purely statistical topic models like K-Means may struggle more with informal narrative data than with other Dutch texts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar human evaluation protocols could improve topic modeling for other minority language varieties in Europe.
  • The findings suggest that automated metrics should be supplemented with cultural context checks in narrative analysis tasks.
  • Extensions of the dataset might allow tracking changes in daily experiences over time or across Flemish regions.

Load-bearing premise

That the human raters' assessments of topic coherence and cultural resonance serve as reliable ground truth representative of the Flemish linguistic and cultural context.

What would settle it

A study repeating the human evaluation with a larger, demographically diverse sample of Flemish speakers and measuring inter-rater agreement to check if BERTopic topics remain preferred.

Figures

Figures reproduced from arXiv: 2504.14707 by Katie Hoemann, Niels Vanhasbroeck, Ratna Kandala.

Figure 1
Figure 1. Figure 1: BERTopic implementation pipeline 2.3.1 Dimensionality Reduction To address the challenges posed by the high dimensionality of the embeddings, we employ Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018) to reduce the high dimensional embeddings to a low dimensional space. UMAP is a non-linear technique adept at preserving both the local and global structure of the data from the hig… view at source ↗
Figure 2
Figure 2. Figure 2: Silhouette Score vs Number of Clusters C. Human Annotation Results [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
read the original abstract

We introduce FLAME (FLemish Accounts of Momentary Experiences), a new corpus of nearly 25,000 daily personal narratives in Belgian-Dutch (Flemish), designed to support research on underrepresented language varieties in Natural Language Processing (NLP). Personal narratives of this kind hold rich potential for uncovering culturally grounded, everyday themes, yet extracting meaningful topics from such data is non-trivial, given the informal register, cultural specificity, and low-resource nature of the Flemish variety. We therefore ask: which topic modeling approach is best suited to reveal the latent themes in this corpus? To answer this, we benchmark three widely used methods: K-Means Clustering, Latent Dirichlet Allocation (LDA), and BERTopic, evaluating their ability to identify coherent and culturally relevant topics. While LDA achieves strong performance on automated coherence metrics, human evaluation reveals that BERTopic consistently produces the most coherent and culturally resonant topics, exposing the limitations of purely statistical methods on narrative-rich data. The diminished performance of K-Means compared to prior work on similar Dutch corpora further highlights the unique linguistic challenges posed by this dataset. Our findings demonstrate that contextual embeddings are critical for robust topic modeling in low-resource, culturally specific domains, and underscore the importance of human-centered evaluation alongside automated metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FLAME, a new corpus of nearly 25,000 daily personal narratives in Flemish (Belgian-Dutch), and benchmarks K-Means clustering, LDA, and BERTopic for topic modeling on this low-resource, culturally specific data. While LDA performs strongly on automated coherence metrics, the authors conclude from human evaluation that BERTopic yields the most coherent and culturally resonant topics, arguing that contextual embeddings are essential for such narrative-rich, informal registers.

Significance. If the human evaluation holds, the work would usefully demonstrate the limitations of purely statistical topic models on culturally grounded personal narratives and provide a new resource for Flemish NLP. The dataset itself is a clear contribution for underrepresented language varieties, and the explicit comparison of automated versus human judgments on coherence and cultural resonance is a strength.

major comments (2)
  1. Human evaluation section: the central claim that BERTopic produces the most coherent and culturally resonant topics rests on human judgments that override LDA's stronger automated coherence scores, yet no information is provided on the number of raters, their linguistic or cultural background, selection criteria, training, or any inter-rater reliability metric (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha). This detail is load-bearing for treating the human results as stable ground truth, especially for culturally specific claims.
  2. Evaluation methodology: the abstract and methods description report both automated and human results but omit sample sizes for the human evaluation, the exact rating scales used for coherence and cultural resonance, and any statistical tests comparing the models. Without these, the superiority claim for BERTopic cannot be fully assessed.
minor comments (2)
  1. The abstract states 'nearly 25,000' narratives; the exact count and any filtering steps should be stated precisely in the dataset section.
  2. Clarify the preprocessing pipeline for the Flemish text (e.g., handling of informal spelling, dialectal variants) before topic modeling, as this affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing the FLAME dataset. The comments on the human evaluation methodology are well-taken and highlight important aspects of transparency. We address each major comment below and will revise the manuscript to incorporate additional details where the original submission was incomplete.

read point-by-point responses
  1. Referee: Human evaluation section: the central claim that BERTopic produces the most coherent and culturally resonant topics rests on human judgments that override LDA's stronger automated coherence scores, yet no information is provided on the number of raters, their linguistic or cultural background, selection criteria, training, or any inter-rater reliability metric (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha). This detail is load-bearing for treating the human results as stable ground truth, especially for culturally specific claims.

    Authors: We agree that these details are essential for establishing the reliability of the human judgments, particularly given the culturally specific nature of the claims. The current manuscript does not provide this information. In the revised version, we will expand the human evaluation section to report the number of raters, their linguistic and cultural backgrounds, selection criteria, any training provided, and inter-rater reliability metrics such as Fleiss' kappa. This will allow readers to better evaluate the stability of the results supporting BERTopic's advantages. revision: yes

  2. Referee: Evaluation methodology: the abstract and methods description report both automated and human results but omit sample sizes for the human evaluation, the exact rating scales used for coherence and cultural resonance, and any statistical tests comparing the models. Without these, the superiority claim for BERTopic cannot be fully assessed.

    Authors: We acknowledge the need for these methodological specifics to support the superiority claims. The manuscript currently omits sample sizes, exact rating scales, and statistical comparisons. We will revise the methods and evaluation sections to include the sample size for the human evaluation, the precise Likert or other scales used for coherence and cultural resonance ratings, and the results of appropriate statistical tests comparing model performance. These changes will make the evaluation fully assessable. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarking against external human judgments

full rationale

The paper introduces the FLAME corpus and benchmarks K-Means, LDA, and BERTopic via automated coherence metrics plus human evaluation of coherence and cultural resonance. The central claim (BERTopic superiority on human criteria) rests on judgments external to the paper's own definitions or fitted parameters, with no equations, self-definitional loops, or renamings of known results. No load-bearing self-citations or ansatzes are invoked to force the outcome; the work is self-contained empirical comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human topic judgments provide a valid external benchmark for cultural relevance; no free parameters or invented entities are introduced beyond standard topic-model hyperparameters.

axioms (1)
  • domain assumption Human raters can reliably judge topic coherence and cultural resonance for Flemish narratives
    Invoked when the paper states that human evaluation reveals BERTopic's superiority

pith-pipeline@v0.9.0 · 5756 in / 1194 out tokens · 24453 ms · 2026-05-22T18:56:50.007965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Computational Linguistics 34(4), 555–596 (2008)

    A survey of text classification algorithms. In Mining Text Data, pages 163–222. Springer, Boston, MA. Artstein, R., & Poesio, M. (2008). Inter -coder agreement for computational linguistics. Computational Linguistics, 34 (4), 555 –596. https://doi.org/10.1162/coli.07-034-R2. Bastian Tamm, Jordi Poncelet, Manon Barberis, and Marjolein Vandermosten

  2. [2]

    In Dutch Speech Tech Day 2024, Hilversum, The Netherlands

    Weakly supervised training improves Flemish ASR of non-standard speech. In Dutch Speech Tech Day 2024, Hilversum, The Netherlands. Berkhin, Pavel

  3. [3]

    In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71

    A survey of clustering data mining techniques. In Jacob Kogan, Charles Nicholas, and Marc Teboulle (eds.), Grouping Multidimensional Data, pages 25–71. Springer. Bhatia, S., Lau, J. H., & Baldwin, T. (2018). Topic intrusion for automatic topic model evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. ...

  4. [4]

    Journal of Machine Learning Research, 3:993–1022

    Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, 22 (NeurIPS). Campello, Ricardo J. G. B., Davoud Moulavi, and Jörg Sander

  5. [5]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. Kamiloğlu, Roza G., et al

  6. [6]

    Emotion, 25(1):271–276

    What makes us feel good? A data -driven investigation of positive emotion experience. Emotion, 25(1):271–276. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Sage Publications. Lau, Jey Han, David Newman, and Timothy Baldwin

  7. [7]

    Least squares quantization in

    Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , pages 530 –539, Gothenburg, Sweden. Association for Computational Linguistics. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Tra...

  8. [8]

    D., Raghavan, P., & Schütze, H

    Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum

  9. [9]

    robbert-2022-dutch-sentence-transformers (revision cdf42f6). https://huggingface.co/NFI/robbert-2022-dutch-sentence- transformers Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (201...

  10. [10]

    Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C

    Lawrence Erlbaum Associates, Mahwah, NJ. Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101 –108). Association for Computational Linguist...

  11. [11]

    Sentence -BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages 3982 –3992. Association for Computational Linguistics. Röder, Michael, Andreas Both, and Alexander Hinneburg

  12. [12]

    Journal of Computational and Applied Mathematics, 20:53–65

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65. Salton, G., & Buckley, C. (1988). Term -weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513 –523. https://doi.org/10.1016/0306-4573(88)90021-0. Schäfer, Karla, Jeon...

  13. [13]

    arXiv preprint arXiv:2409.10173

    jina-embeddings-v3: Multilingual embeddings with task LoRA. arXiv preprint arXiv:2409.10173. Zhang, Xin, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al

  14. [14]

    What is going on now or since the last prompt, and how do you feel about it?

    mGTE: Generalized long- context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412. Appendix: A. Participants Participants were a community sample recruited through flyers, online posts, and word-of-mouth. To be...