arxiv: 1605.03481 · v2 · pith:MUD3CD2Wnew · submitted 2016-05-11 · 💻 cs.LG · cs.CL

Tweet2Vec: Character-Based Distributed Representations for Social Media

Bhuwan Dhingra , Zhong Zhou , Dylan Fitzpatrick , Michael Muehl , William W. Cohen This is my paper

classification 💻 cs.LG cs.CL

keywords charactertweet2vecapproachesmediamodelpostsrepresentationssequences

0 comments

read the original abstract

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

This paper has not been read by Pith yet.

Tweet2Vec: Character-Based Distributed Representations for Social Media

discussion (0)