Mining Twitter to Assess the Determinants of Health Behavior towards Human Papillomavirus Vaccination in the United States

Adam G. Dunn; Christopher Wheldon; Cui Tao; Hansi Zhang; Jiang Bian; Jinhai Huo; Mattia Prosperi; Rui Zhang; Yi Guo

arxiv: 1907.11624 · v1 · pith:QEHGOFH5new · submitted 2019-07-06 · 💻 cs.SI · cs.CY· cs.LG

Mining Twitter to Assess the Determinants of Health Behavior towards Human Papillomavirus Vaccination in the United States

Hansi Zhang , Christopher Wheldon , Adam G. Dunn , Cui Tao , Jinhai Huo , Rui Zhang , Mattia Prosperi , Yi Guo

show 1 more author

Jiang Bian

This is my paper

Pith reviewed 2026-05-25 01:32 UTC · model grok-4.3

classification 💻 cs.SI cs.CYcs.LG

keywords Twitter miningHPV vaccinationhealth behaviortopic modelingIntegrated Behavior Modelsocial media analysisHINTS surveygeocoded tweets

0 comments

The pith

Twitter mining can assess HPV vaccination health behaviors comparably to surveys and yield additional insights through a theory-driven approach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the feasibility of mining Twitter to evaluate determinants of consumers' health behavior towards HPV vaccination, guided by the Integrated Behavior Model. Researchers collected millions of tweets spanning 2014-2018, preprocessed and geocoded them, applied a rule-based classifier to separate promotional content from consumer discussions, and used topic modeling to extract 122 themes. These themes were then aligned with responses from the Health Information National Trends Survey, revealing correlations in topic prevalence and geographic distributions. The work shows that social media analysis can match survey findings while supplying extra detail through structured theoretical mapping.

Core claim

Not only mining Twitter to assess consumers' health behaviors can obtain results comparable to surveys but can yield additional insights via a theory-driven approach.

What carries the argument

Rule-based classifier that separates promotional information from consumers' discussions, followed by topic modeling to discover themes mapped against Integrated Behavior Model constructs and HINTS survey questions.

If this is right

87 of the 122 topics show correlations between promotional tweets and consumer discussions.
35 topics map directly to specific HPV-related questions in the HINTS survey by keyword.
112 topics align with constructs from the Integrated Behavior Model.
45 topics exhibit statistically significant correlations with HINTS responses when compared by geographic distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Health agencies might use similar Twitter pipelines to track shifts in vaccination sentiment between national survey waves.
Theory-guided topic models could surface emerging public concerns that fixed questionnaire items miss.
The same classification-plus-mapping pipeline could be tested on other preventive behaviors such as flu shots or colorectal screening.

Load-bearing premise

The rule-based classifier accurately separates promotional information from consumers' discussions and that the resulting topics validly represent determinants of health behavior as defined by the Integrated Behavior Model.

What would settle it

If manual review reveals high rates of misclassification by the rule-based model or if the 45 topics fail to show statistically significant geographic correlations with actual HINTS responses, the central claim would not hold.

read the original abstract

Objectives To test the feasibility of using Twitter data to assess determinants of consumers' health behavior towards Human papillomavirus (HPV) vaccination informed by the Integrated Behavior Model (IBM). Methods We used three Twitter datasets spanning from 2014 to 2018. We preprocessed and geocoded the tweets, and then built a rule-based model that classified each tweet into either promotional information or consumers' discussions. We applied topic modeling to discover major themes, and subsequently explored the associations between the topics learned from consumers' discussions and the responses of HPV-related questions in the Health Information National Trends Survey (HINTS). Results We collected 2,846,495 tweets and analyzed 335,681 geocoded tweets. Through topic modeling, we identified 122 high-quality topics. The most discussed consumer topic is "cervical cancer screening"; while in promotional tweets, the most popular topic is to increase awareness of "HPV causes cancer". 87 out of the 122 topics are correlated between promotional information and consumers' discussions. Guided by IBM, we examined the alignment between our Twitter findings and the results obtained from HINTS. 35 topics can be mapped to HINTS questions by keywords, 112 topics can be mapped to IBM constructs, and 45 topics have statistically significant correlations with HINTS responses in terms of geographic distributions. Conclusion Not only mining Twitter to assess consumers' health behaviors can obtain results comparable to surveys but can yield additional insights via a theory-driven approach. Limitations exist, nevertheless, these encouraging results impel us to develop innovative ways of leveraging social media in the changing health communication landscape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets 45 geographic correlations between Twitter topics and HINTS responses on HPV, but the unvalidated rule-based classifier is a real problem for the comparability claim.

read the letter

The punchline is that they extracted 122 topics from 335k geocoded tweets, split them into promotional and consumer categories with a rule-based model, mapped 112 to IBM constructs, and found 45 topics with statistically significant geographic links to HINTS answers. That specific set of numbers is new for this domain. They also show 87 topics correlated across the two tweet types and 35 keyword-mapped to HINTS questions. The work is straightforward: large-scale collection, standard LDA, theory-guided mapping, and spatial correlation checks. It does a clean job of scaling up the data and trying to anchor it in the Integrated Behavior Model rather than just doing unsupervised mining. Credit for shipping concrete counts instead of vague claims. The soft spot is the classifier. The abstract and stress-test note give no precision, recall, or ground-truth check on how well the rules separate promotional from consumer tweets. If that partition is noisy, the downstream topic correlations and the claim that Twitter yields results comparable to surveys do not follow cleanly. The post-hoc keyword mappings to HINTS and IBM are also theory-driven rather than data-derived, which is common but limits how strongly one can say the topics represent the same determinants. Minor issues include no mention of error bars or multiple-testing correction in the abstract. This is for public-health informatics groups already working with social media data as a survey supplement. A reader interested in HPV communication or IBM applications could pull useful numbers from it. It has enough empirical content and a clear methods pipeline to deserve a serious referee rather than a desk reject, even though the validation gap will need addressing.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that mining Twitter data can assess determinants of HPV vaccination health behaviors in a manner comparable to traditional surveys like HINTS, while providing additional insights through a theory-driven approach using the Integrated Behavior Model (IBM). It describes collecting over 2.8 million tweets, geocoding 335k, applying a rule-based classifier to distinguish promotional from consumer tweets, topic modeling to identify 122 topics, mapping them to IBM constructs and HINTS questions, and finding correlations including 45 with geographic significance.

Significance. If the classifier validation and mapping robustness hold, the work could demonstrate a scalable, theory-augmented method for real-time public health surveillance that complements surveys with social media volume and geographic granularity.

major comments (3)

[Methods] Methods section: the rule-based classifier separating promotional information from consumers' discussions is presented without any reported validation (precision, recall, accuracy, ground-truth annotation, or inter-rater statistics). This partition is load-bearing for the central claim, because the 122 topics, 35 HINTS mappings, 112 IBM mappings, 87 promotional-consumer correlations, and 45 geographic links all inherit any misclassification error.
[Results] Results section: the 35 keyword-based mappings of topics to HINTS questions and 112 mappings to IBM constructs are performed post-hoc and theory-guided; no quantitative validation, sensitivity analysis to keyword choice, or inter-rater reliability for the mappings is supplied, so the asserted comparability to survey results does not follow from the reported data.
[Results] Results section: the 45 topics reported to have statistically significant geographic correlations with HINTS responses, and the 87 topics correlated between promotional and consumer tweets, are given without error bars, confidence intervals, or correction for multiple testing, weakening the strength of the geographic and cross-type alignment claims.

minor comments (2)

[Abstract] Abstract: the final sentence contains an awkward construction ('Not only mining Twitter to assess consumers' health behaviors can obtain results comparable to surveys'); rephrase for grammatical clarity.
[Methods] Methods: the exact keyword rules for the classifier and the procedure for selecting the number of topics (free parameter) should be stated explicitly to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Methods] Methods section: the rule-based classifier separating promotional information from consumers' discussions is presented without any reported validation (precision, recall, accuracy, ground-truth annotation, or inter-rater statistics). This partition is load-bearing for the central claim, because the 122 topics, 35 HINTS mappings, 112 IBM mappings, 87 promotional-consumer correlations, and 45 geographic links all inherit any misclassification error.

Authors: We agree that the absence of formal validation metrics for the rule-based classifier is a limitation. The classifier relies on explicit keyword rules distinguishing promotional language from consumer discussions, but no precision/recall or inter-rater statistics were reported. In the revised manuscript we will add a dedicated validation subsection: a random sample of 500 tweets will be independently annotated by two researchers, with precision, recall, accuracy, and Cohen's kappa reported. This directly strengthens the foundation for all downstream results. revision: yes
Referee: [Results] Results section: the 35 keyword-based mappings of topics to HINTS questions and 112 mappings to IBM constructs are performed post-hoc and theory-guided; no quantitative validation, sensitivity analysis to keyword choice, or inter-rater reliability for the mappings is supplied, so the asserted comparability to survey results does not follow from the reported data.

Authors: The mappings were constructed by direct keyword overlap between discovered topics and the wording of HINTS items or IBM constructs, which provides transparency. Nevertheless, we acknowledge the lack of sensitivity analysis or inter-rater checks. In revision we will (i) vary keyword inclusion thresholds and report how the set of 35/112 mappings changes, and (ii) have two independent coders assess a 20% subsample of mappings for agreement. These additions will quantify robustness. revision: yes
Referee: [Results] Results section: the 45 topics reported to have statistically significant geographic correlations with HINTS responses, and the 87 topics correlated between promotional and consumer tweets, are given without error bars, confidence intervals, or correction for multiple testing, weakening the strength of the geographic and cross-type alignment claims.

Authors: We concur that reporting confidence intervals and applying multiple-testing correction is necessary for rigorous interpretation. In the revised manuscript we will recompute all correlations (Pearson/Spearman as appropriate) with 95% bootstrap confidence intervals and apply the Benjamini-Hochberg procedure. Updated counts of significant associations (after correction) and the corresponding intervals will be presented in revised tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external theory and independent statistical checks

full rationale

The paper preprocesses tweets, applies a rule-based classifier to separate promotional vs. consumer content, runs topic modeling to extract 122 topics, performs keyword-based mapping of topics to IBM constructs and HINTS questions, and computes geographic correlations between topics and HINTS responses. None of these steps reduce by construction to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. IBM is an external model cited from prior literature; mappings are explicit and post-hoc rather than tautological; correlations are computed against an independent survey dataset. The central comparability claim therefore rests on observable statistical alignments rather than any internal equivalence of inputs and outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the accuracy of the rule-based tweet classifier and the assumption that Twitter users and geocoded content represent broader US health behavior patterns; topic modeling hyperparameters and the choice of 122 topics are not detailed.

free parameters (1)

number of topics
122 topics selected via topic modeling to discover major themes; choice affects downstream correlations with HINTS.

axioms (2)

domain assumption Rule-based model accurately distinguishes promotional information from consumers' discussions
Invoked in methods to create the two tweet categories used for all subsequent topic modeling and correlation analysis.
domain assumption Geocoded tweets are representative of US population health behaviors
Required for geographic distribution comparisons to HINTS responses.

pith-pipeline@v0.9.0 · 5859 in / 1461 out tokens · 23845 ms · 2026-05-25T01:32:08.957938+00:00 · methodology

Mining Twitter to Assess the Determinants of Health Behavior towards Human Papillomavirus Vaccination in the United States

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)