pith. sign in

arxiv: 1506.02275 · v2 · pith:SHRQ7W4Lnew · submitted 2015-06-07 · 💻 cs.CL

Confounds and Consequences in Geotagged Twitter Data

classification 💻 cs.CL
keywords datadifferencesgendergeolocationlinguisticstudiestext-basedtwitter
0
0 comments X
read the original abstract

Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.