The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

Anneke Buffone; Daniel Preotiuc-Pietro; Daniel Rieman; H. Andrew Schwartz; Lyle H. Ungar; Salvatore Giorgi

arxiv: 1808.09600 · v1 · pith:FYK2JFFHnew · submitted 2018-08-29 · 💻 cs.SI · cs.CY

The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

Salvatore Giorgi , Daniel Preotiuc-Pietro , Anneke Buffone , Daniel Rieman , Lyle H. Ungar , H. Andrew Schwartz This is my paper

classification 💻 cs.SI cs.CY

keywords aggregatedcommunity-leveloutcomespredictionbilliondatapredictionstweets

0 comments

read the original abstract

Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r=.73 to .82 for median income prediction or r=.37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets -- over 1 billion of which were mapped to counties, available for research.

This paper has not been read by Pith yet.

The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions

discussion (0)