A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

Philip A. Fisher; Sihong Liu; Yan Jiang

arxiv: 2605.23093 · v1 · pith:SVLR2QKInew · submitted 2026-05-21 · 💻 cs.CL · cs.CY

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

Yan Jiang , Sihong Liu , Philip A. Fisher This is my paper

Pith reviewed 2026-05-25 05:18 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords topic modelingBERTopicstructural topic modelsshort textsurvey responsestopic coherencecontextual augmentationapplied psychology

0 comments

The pith

BERTopic with contextual augmentation outperforms Structural Topic Models on coherence and interpretability for short survey responses, while STM supports stronger covariate inference

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares Structural Topic Models, a probabilistic approach, to BERTopic, an embedding-based method, specifically for short open-ended survey responses common in psychology. BERTopic conditions, particularly those using contextual augmentation to add semantic context to brief texts, achieved higher topic coherence and produced more interpretable and stable topics than STM. STM topics tended to be broader and less distinct, but the model allows for stronger statistical analysis of how topics relate to covariates such as participant characteristics. The evaluation across multiple preprocessing and embedding variations leads to the conclusion that the two methods have complementary strengths rather than one dominating the other for applied social science tasks.

Core claim

BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths.

What carries the argument

Head-to-head comparison of three STM conditions and five BERTopic conditions varying typographical correction, stemming, embedding choice, and contextual augmentation for short responses

If this is right

BERTopic is preferable when the goal is coherent and interpretable topics from short texts.
STM is preferable when statistical inference on topic-covariate relationships is needed.
Contextual augmentation yields the largest performance gains for BERTopic on very short responses.
Using higher-dimensional embeddings alone does not improve coherence and leads to more data loss.
The methods can be used together to combine exploratory topic quality with inferential capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers in other short-text domains could apply similar comparative evaluations to guide method choice.
A model that merges BERTopic coherence with STM inference features could resolve the identified trade-off.
New metrics might be developed to assess both topic quality and statistical utility simultaneously.
Extending the evaluation to additional datasets would test the robustness of the complementary strengths finding.

Load-bearing premise

That topic coherence, interpretability, stability metrics and the tested variations sufficiently identify the performance differences most relevant for applied psychology with short responses.

What would settle it

Replicating the study on a new short survey response dataset and finding that STM matches or exceeds BERTopic in coherence and stability, or that BERTopic matches STM in supporting inferential covariate analysis.

read the original abstract

Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript compares Structural Topic Models (STM) and BERTopic for short, open-ended survey responses in applied psychology. It evaluates three STM conditions and five BERTopic conditions (varying preprocessing, embeddings, and a novel contextual augmentation strategy), reporting that BERTopic with contextual augmentation yields higher topic coherence and more interpretable/stable topics while STM supports stronger inferential covariate analysis; the methods are positioned as complementary with practical guidance offered for social science applications.

Significance. If the results hold under more rigorous validation, the work provides a useful empirical comparison for researchers handling short survey texts and introduces contextual augmentation as a practical technique. The explicit discussion of complementary strengths (coherence/interpretability vs. inferential support) is a constructive contribution. However, the significance is limited by the absence of external validation against psychological constructs or downstream utility, which directly affects applicability to the target use case.

major comments (2)

[Abstract and §3] Abstract and §3 (Methods/Evaluation): the central claim that BERTopic with contextual augmentation outperforms STM on coherence and interpretability (and that the tested metrics suffice for applied psychology) rests on topic coherence and qualitative criteria alone; no external validation, downstream task performance, expert ratings tied to psychological relevance, or statistical tests for differences are reported, leaving open whether the metrics align with construct validity or predictive utility for survey outcomes.
[Abstract] Abstract: the reported comparative outcomes supply no information on dataset characteristics (number of responses, length distribution, domain), exact coherence metrics used, statistical testing, sample sizes, or controls for confounds, which is load-bearing for assessing the robustness of the performance claims.

minor comments (1)

The description of the five BERTopic conditions and three STM conditions could be presented in a single summary table for easier comparison of the experimental design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below, proposing revisions where they strengthen the work without altering its core scope.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods/Evaluation): the central claim that BERTopic with contextual augmentation outperforms STM on coherence and interpretability (and that the tested metrics suffice for applied psychology) rests on topic coherence and qualitative criteria alone; no external validation, downstream task performance, expert ratings tied to psychological relevance, or statistical tests for differences are reported, leaving open whether the metrics align with construct validity or predictive utility for survey outcomes.

Authors: We agree that external validation against psychological constructs or downstream utility would provide stronger evidence of applicability. Our evaluation follows standard topic modeling practices by reporting coherence (NPMI) and qualitative assessments of interpretability and stability, which are the dominant metrics in the literature for model comparison. We will add pairwise statistical tests (e.g., t-tests or Wilcoxon tests with correction) for coherence differences across conditions. We will also expand the discussion and limitations sections to explicitly address the gap between coherence/qualitative metrics and construct validity or predictive utility in applied psychology, noting that such validation would require additional expert annotations or outcome-linked data beyond the current study. These changes clarify the scope of our claims while preserving the paper's focus on comparative evaluation. revision: partial
Referee: [Abstract] Abstract: the reported comparative outcomes supply no information on dataset characteristics (number of responses, length distribution, domain), exact coherence metrics used, statistical testing, sample sizes, or controls for confounds, which is load-bearing for assessing the robustness of the performance claims.

Authors: We will revise the abstract to include the key dataset details (number of responses, mean and range of response lengths, domain as applied psychology open-ended surveys), the exact coherence metric (normalized pointwise mutual information), mention of statistical testing for differences, and a brief note on preprocessing controls. These additions will make the performance claims more transparent and allow readers to better assess robustness without lengthening the abstract excessively. revision: yes

Circularity Check

0 steps flagged

Empirical methods comparison with no derivations or self-referential elements

full rationale

This paper is a direct empirical comparison of STM and BERTopic variants on short survey responses, reporting topic coherence scores, qualitative interpretability judgments, and covariate analysis differences. No equations, first-principles derivations, parameter fits presented as predictions, or load-bearing self-citations appear in the abstract or described content. All reported outcomes are computed from the input data and standard metrics without reduction to the paper's own inputs by construction, so the evaluation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparison study with no new mathematical models, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5742 in / 1100 out tokens · 23874 ms · 2026-05-25T05:18:13.183634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

a dog eats a donut

STM AND BERTOPIC FOR SURVEY RESPONSES 1 A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses Yan Jiang, Sihong Liu, Philip A. Fisher Stanford Center on Early Childhood, Stanford University Author Note Yan Jiang https://orcid.org/0000-0002-8825-8641 Sihong Liu https://orcid.org/0000-0002-5188-5334 Philip A...

work page 2014
[2]

What are the biggest challenges and concerns for you and your family right now?

First, each document in the corpus is transformed into a contextualized embedding vector using a pretrained language model. Because distances in very high-dimensional spaces can become less informative, BERTopic next applies dimensionality reduction, most commonly using UMAP (McInnes et al., 2018), to project the embeddings into a lower-dimensional space ...

work page 2018
[3]

This resulted in 18 model runs per condition and a total of 144 model runs across all conditions

(Angelov, 2020; Grootendorst, 2022). This resulted in 18 model runs per condition and a total of 144 model runs across all conditions. BERTopic models were implemented in Python using Google Colab (Grootendorst, 2022), whereas STM models were estimated using the stm package in R (Roberts et al., 2019). For STM AND BERTOPIC FOR SURVEY RESPONSES 15 STM, pre...

work page 2020
[4]

financial,

to evaluate topic coherence, which has been shown to correlate with human judgements (Lau et al., 2014). NPMI measures the extent to which pairs of words within a topic co-occur in the corpus relative to chance, based on the assumption that coherent topics consist of words that frequently appear together. NPMI ranges from −1 to 1, with higher values indic...

work page 2014
[5]

Spectral

Figure 7 Topic Prevalence by Time and Provider Type STM AND BERTOPIC FOR SURVEY RESPONSES 22 Note. Topic prevalence is measured as proportions rather than raw frequencies; therefore, temporal patterns should be interpreted within each model’s own set of identified topics. Discussion This study highlights that model selection and specification play importa...

work page doi:10.48550/arxiv.2008.09470 2023
[6]

B., Ruiz, F

https://doi.org/10.1038/s44271-025-00376-6 Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. Egger, R., & Yu, J. (2022). A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers in Sociology, 7, ...

work page doi:10.1038/s44271-025-00376-6 2020
[7]

Feuerriegel, S., Maarouf, A., Bär, D., Geissler, D., Schweisthal, J., Pröllochs, N., Robertson, C

ACM Transactions on Intelligent Systems and Technology, 15(5), 1–25. Feuerriegel, S., Maarouf, A., Bär, D., Geissler, D., Schweisthal, J., Pröllochs, N., Robertson, C. E., Rathje, S., Hartmann, J., Mohammad, S. M., & Netzer, O. (2025). Using natural language processing to analyse text data in behavioural science. Nature Reviews Psychology, 4(2), 96–111. G...

work page 2025
[8]

STM AND BERTOPIC FOR SURVEY RESPONSES 29 McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://arxiv.org/abs/1301.3781 New...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1037/adb0001138 2018

[1] [1]

a dog eats a donut

STM AND BERTOPIC FOR SURVEY RESPONSES 1 A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses Yan Jiang, Sihong Liu, Philip A. Fisher Stanford Center on Early Childhood, Stanford University Author Note Yan Jiang https://orcid.org/0000-0002-8825-8641 Sihong Liu https://orcid.org/0000-0002-5188-5334 Philip A...

work page 2014

[2] [2]

What are the biggest challenges and concerns for you and your family right now?

First, each document in the corpus is transformed into a contextualized embedding vector using a pretrained language model. Because distances in very high-dimensional spaces can become less informative, BERTopic next applies dimensionality reduction, most commonly using UMAP (McInnes et al., 2018), to project the embeddings into a lower-dimensional space ...

work page 2018

[3] [3]

This resulted in 18 model runs per condition and a total of 144 model runs across all conditions

(Angelov, 2020; Grootendorst, 2022). This resulted in 18 model runs per condition and a total of 144 model runs across all conditions. BERTopic models were implemented in Python using Google Colab (Grootendorst, 2022), whereas STM models were estimated using the stm package in R (Roberts et al., 2019). For STM AND BERTOPIC FOR SURVEY RESPONSES 15 STM, pre...

work page 2020

[4] [4]

financial,

to evaluate topic coherence, which has been shown to correlate with human judgements (Lau et al., 2014). NPMI measures the extent to which pairs of words within a topic co-occur in the corpus relative to chance, based on the assumption that coherent topics consist of words that frequently appear together. NPMI ranges from −1 to 1, with higher values indic...

work page 2014

[5] [5]

Spectral

Figure 7 Topic Prevalence by Time and Provider Type STM AND BERTOPIC FOR SURVEY RESPONSES 22 Note. Topic prevalence is measured as proportions rather than raw frequencies; therefore, temporal patterns should be interpreted within each model’s own set of identified topics. Discussion This study highlights that model selection and specification play importa...

work page doi:10.48550/arxiv.2008.09470 2023

[6] [6]

B., Ruiz, F

https://doi.org/10.1038/s44271-025-00376-6 Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. Egger, R., & Yu, J. (2022). A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers in Sociology, 7, ...

work page doi:10.1038/s44271-025-00376-6 2020

[7] [7]

Feuerriegel, S., Maarouf, A., Bär, D., Geissler, D., Schweisthal, J., Pröllochs, N., Robertson, C

ACM Transactions on Intelligent Systems and Technology, 15(5), 1–25. Feuerriegel, S., Maarouf, A., Bär, D., Geissler, D., Schweisthal, J., Pröllochs, N., Robertson, C. E., Rathje, S., Hartmann, J., Mohammad, S. M., & Netzer, O. (2025). Using natural language processing to analyse text data in behavioural science. Nature Reviews Psychology, 4(2), 96–111. G...

work page 2025

[8] [8]

STM AND BERTOPIC FOR SURVEY RESPONSES 29 McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://arxiv.org/abs/1301.3781 New...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1037/adb0001138 2018