pith. sign in

arxiv: 1907.00725 · v1 · pith:LGM3UXOKnew · submitted 2019-06-26 · 💻 cs.SI

Social Media-based User Embedding: A Literature Review

Pith reviewed 2026-05-25 15:25 UTC · model grok-4.3

classification 💻 cs.SI
keywords user embeddingsocial mediarepresentation learningliterature reviewheterogeneous databehavior modelingtrait prediction
0
0 comments X

The pith

Social media user embeddings learned from text and images support scalable models of human traits and behaviors when ground truth labels are scarce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reviews methods for creating low-dimensional embeddings that represent social media users by combining their posts, images, and other data. The central motivation is that these embeddings transfer knowledge from large unlabeled datasets to prediction tasks where collecting ground-truth labels for traits like personality or behavior is expensive. The survey examines typical techniques for building a single unified embedding from heterogeneous inputs. It closes by listing open problems and suggested research directions.

Core claim

Automated representation learning can produce low-dimensional user embeddings from heterogeneous social media data such as texts and images. These embeddings enable high-performance models of latent human traits and behaviors because abundant unlabeled data can substitute for costly ground-truth labels at large scale. The review covers standard methods that integrate multiple data modalities into one representation and identifies current issues along with future directions.

What carries the argument

Learning unified user embeddings from heterogeneous social media data sources such as text and images.

If this is right

  • Trait and behavior models can be built at larger scale by leveraging embeddings trained on abundant unlabeled social media data.
  • Combining multiple modalities such as text and images into one embedding improves downstream prediction performance.
  • Current embedding methods still face challenges around data heterogeneity, scalability, and evaluation that limit their reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding approach could transfer to other domains where labeled examples of user attributes are expensive but public interaction data are plentiful.
  • Wider adoption might raise new questions about how to protect sensitive attributes that become encoded in the embeddings.
  • Extending the surveyed methods with explicit network structure from follower graphs could produce richer representations.

Load-bearing premise

The methods and issues covered in the review are representative of the main approaches in the literature without significant selection bias.

What would settle it

A systematic search that uncovers major user-embedding techniques or data-fusion strategies not discussed in the survey.

Figures

Figures reproduced from arXiv: 1907.00725 by Shimei Pan, Tao Ding.

Figure 1
Figure 1. Figure 1: shows the typical architecture of a system that employs automated user embedding for personal traits and behavior analysis. One or more types of user data are first extracted from a social media account. For each type of user data such as text or image, a set of latent user fea￾tures is learned automatically via single-view user embed￾ding (e.g., text-based user embedding and image-based user embedding). T… view at source ↗
read the original abstract

Automated representation learning is behind many recent success stories in machine learning. It is often used to transfer knowledge learned from a large dataset (e.g., raw text) to tasks for which only a small number of training examples are available. In this paper, we review recent advance in learning to represent social media users in low-dimensional embeddings. The technology is critical for creating high performance social media-based human traits and behavior models since the ground truth for assessing latent human traits and behavior is often expensive to acquire at a large scale. In this survey, we review typical methods for learning a unified user embeddings from heterogeneous user data (e.g., combines social media texts with images to learn a unified user representation). Finally we point out some current issues and future directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper is a literature review surveying recent methods for learning unified low-dimensional embeddings of social media users from heterogeneous data (e.g., text combined with images). It positions such embeddings as critical for downstream models of human traits and behavior, reviews 'typical methods,' and concludes by identifying current issues and future directions.

Significance. A representative survey of user-embedding techniques for social media would be useful for researchers working on transfer learning and behavioral modeling where labeled data is scarce. The central claim that the reviewed set captures the main approaches, however, cannot be evaluated without evidence of systematic selection.

major comments (1)
  1. [Abstract / Introduction] The manuscript states that it reviews 'typical methods' for unified user embeddings (abstract and introduction) but contains no methods section, no search strategy, no list of databases or keywords, no date range, and no inclusion/exclusion criteria. This directly undermines the representativeness claim required for the survey's central contribution.
minor comments (1)
  1. [Abstract] Abstract contains grammatical issues ('recent advance' should be 'advances'; 'unified user embeddings' is used inconsistently with singular/plural forms elsewhere).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The observation regarding the lack of explicit methodology is valid and we will address it by adding a dedicated section on the review process in the revised manuscript. This will clarify the scope without altering the narrative nature of the survey.

read point-by-point responses
  1. Referee: [Abstract / Introduction] The manuscript states that it reviews 'typical methods' for unified user embeddings (abstract and introduction) but contains no methods section, no search strategy, no list of databases or keywords, no date range, and no inclusion/exclusion criteria. This directly undermines the representativeness claim required for the survey's central contribution.

    Authors: We agree that transparency in selection is important. Although the review focuses on typical methods drawn from prominent recent works rather than asserting exhaustive coverage, we will revise by inserting a new 'Review Methodology' subsection. It will specify the primary sources (Google Scholar, arXiv, ACL Anthology), search terms (combinations of 'user embedding', 'social media', 'multimodal representation', 'heterogeneous data'), approximate time frame (2014-2019), and inclusion criteria (methods producing unified low-dimensional user vectors from multiple modalities). This addition will enable readers to evaluate scope while preserving the paper's emphasis on representative techniques and open issues. revision: yes

Circularity Check

0 steps flagged

Survey paper with no derivations or load-bearing claims exhibits no circularity

full rationale

This is a literature review summarizing existing methods for user embeddings from social media data. It presents no original mathematical derivations, predictions, first-principles results, or equations that could reduce to inputs by construction. No self-citations function as load-bearing justifications for any claimed uniqueness or ansatz, and the review structure contains no fitted parameters renamed as predictions. The absence of a methods section for paper selection is a methodological limitation but does not create circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature review with no new mathematical models, parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5644 in / 864 out tokens · 20549 ms · 2026-05-25T15:25:55.085176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    [Abel et al., 2013] F. Abel, E. Herder, G. Houben, N. Henze, and D. Krause. Cross-system user modeling and personalization on the social web. UMUAI,

  2. [2]

    Quantifying mental health from social media with neural user embeddings

    [Amir et al., 2017] Silvio Amir, Glen Coppersmith, Paula Car- valho, Mario J Silva, and Bryon C Wallace. Quantifying mental health from social media with neural user embeddings. In Ma- chine Learning for Healthcare Conference,

  3. [3]

    Andrew, R

    [Andrew et al., 2013] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML,

  4. [4]

    Baroni, G

    [Baroni et al., 2014] M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL,

  5. [5]

    Representation learning: A review and new perspec- tives

    [Bengio et al., 2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspec- tives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8),

  6. [6]

    Benton, R

    [Benton et al., 2016] A. Benton, R. Arora, and M. Dredze. Learn- ing multiview embeddings of twitter users. In ACL,

  7. [7]

    Predicting depression via social media

    [De Choudhury et al., 2013] Munmun De Choudhury, Michael Ga- mon, Scott Counts, and Eric Horvitz. Predicting depression via social media. In ICWSM,

  8. [8]

    [Ding et al., 2017] T. Ding, W. Bickel, and S. Pan. Multi-view un- supervised user feature embedding for social media-based sub- stance use prediction. In EMNLP,

  9. [9]

    Twitter user ge- olocation using deep multiview learning

    [Do et al., 2018] Tien Huu Do, Duc Minh Nguyen, Evaggelia Tsili- gianni, Bruno Cornelis, and Nikos Deligiannis. Twitter user ge- olocation using deep multiview learning. In ICASSP,

  10. [10]

    User profiling through deep multimodal fusion

    [Farnadi et al., 2018] Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-Francine Moens. User profiling through deep multimodal fusion. In WSDM,

  11. [11]

    [Gao et al., 2014] H. Gao, J. Mahmud, J. Chen, J. Nichols, and Michelle X. Zhou. Modeling user attitude toward controversial topics in online social media. In ICWSM,

  12. [12]

    Predicting personality with social media

    [Golbeck et al., 2011] Jennifer Golbeck, Cristina Robles, and Karen Turner. Predicting personality with social media. In CHI,

  13. [13]

    node2vec: Scalable feature learning for networks

    [Grover and Leskovec, 2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. InKDD,

  14. [14]

    Hardoon, S

    [Hardoon et al., 2004] D. Hardoon, S. Szedmak, and J. Shawe- Taylor. Canonical correlation analysis: An overview with appli- cation to learning methods. Neural computation, 16(12),

  15. [15]

    Hinton and R

    [Hinton and Salakhutdinov, 2006] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. sci- ence, 313(5786),

  16. [16]

    [Hu et al., 2016] T. Hu, H. Xiao, J. Luo, and T. Nguyen. What the language you tweet says about your occupation. In ICWSM,

  17. [17]

    Kilic ¸ and S

    [Kilic ¸ and Pan, 2016] D. Kilic ¸ and S. Pan. Analyzing and prevent- ing bias in text-based personal trait prediction algorithms. In IC- TAI,

  18. [18]

    Kosinski, D

    [Kosinski et al., 2013] M. Kosinski, D. Stillwell, and T. Graepel. Private traits and attributes are predictable from digital records of human behavior. PNAS, 110(15),

  19. [19]

    Distributed representations of sentences and documents

    [Le and Mikolov, 2014] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML,

  20. [20]

    Attributed social network embedding

    [Liao et al., 2018] Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. Attributed social network embedding. TKDE,

  21. [21]

    [Liu et al., 2016] L. Liu, D. Preotiuc-Pietro, Z. Samani, M. Moghaddam, and L. Ungar. Analyzing personality through social media profile picture choice. In ICWSM,

  22. [22]

    Distributed representations of words and phrases and their compositionality

    [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS,

  23. [23]

    Unifying text, metadata, and user network representations with a neural network for geoloca- tion prediction

    [Miura et al., 2017] Yasuhide Miura, Motoki Taniguchi, Tomoki Taniguchi, and Tomoko Ohkuma. Unifying text, metadata, and user network representations with a neural network for geoloca- tion prediction. In ACL,

  24. [24]

    Pennacchiotti and A

    [Pennacchiotti and Popescu, 2011] M. Pennacchiotti and A. Popescu. A machine learning approach to twitter user classification. ICWSM,

  25. [25]

    The development and psy- chometric properties of liwc2015

    [Pennebaker et al., 2015] James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. The development and psy- chometric properties of liwc2015. Technical report,

  26. [26]

    Deepwalk: Online learning of social representations

    [Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD,

  27. [27]

    Preot ¸iuc-Pietro, V

    [Preot ¸iuc-Pietroet al., 2015] D. Preot ¸iuc-Pietro, V . Lampos, and N. Aletras. An analysis of the user occupational class through twitter content. In ACL,

  28. [28]

    Preot ¸iuc-Pietro, Y

    [Preot ¸iuc-Pietroet al., 2017] D. Preot ¸iuc-Pietro, Y . Liu, D. Hop- kins, and L. Ungar. Beyond binary labels: political ideology prediction of twitter users. In ACL,

  29. [29]

    Char- acterizing and detecting hateful users on twitter

    [Ribeiro et al., 2018] Manoel Horta Ribeiro, Pedro H Calais, Yuri A Santos, Virg´ılio AF Almeida, and Wagner Meira Jr. Char- acterizing and detecting hateful users on twitter. In AAAI,

  30. [30]

    Schwartz, J

    [Schwartz et al., 2013] A. Schwartz, J. Eichstaedt, M. Kern, L. Dz- iurzynski, S. Ramones, M. Agrawal, A. Shah, M. Kosinski, D. Stillwell, M. Seligman, et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8(9),

  31. [31]

    Sharma, A

    [Sharma et al., 2012] A. Sharma, A. Kumar, H. Daume, and D. Ja- cobs. Generalized multiview analysis: A discriminative latent space. In CVPR,

  32. [32]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    [Simonyan and Zisserman, 2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale im- age recognition. arXiv:1409.1556,

  33. [33]

    [Song et al., 2015] X. Song, L. Nie, L. Zhang, M. Liu, and T. Chua. Interest inference via structure-constrained multi-source multi- task learning. In IJCAI,

  34. [34]

    [Song et al., 2016] X. Song, Z. Ming, L. Nie, Y . Zhao, and T. Chua. V olunteerism tendency prediction via harvesting multiple social networks. TOIS,

  35. [35]

    Modelling context with user em- beddings for sarcasm detection in social media

    [Wallace et al., 2016] Silvio Amir Byron C Wallace, Hao Lyu, and Paula Carvalho M´ario J Silva. Modelling context with user em- beddings for sarcasm detection in social media. CoNLL,

  36. [36]

    Community preserving net- work embedding

    [Wang et al., 2017] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Community preserving net- work embedding. In AAAI,

  37. [37]

    [Yang et al., 2015] C. Yang, S. Pan, J. Mahmud, H. Yang, and P. Srinivasan. Using personal traits for brand preference predic- tion. In EMNLP,

  38. [38]

    Bi- ased random walk based social regularization for word embed- dings

    [Zeng et al., 2018] Ziqian Zeng, Xin Liu, and Yangqiu Song. Bi- ased random walk based social regularization for word embed- dings. In IJCAI,

  39. [39]

    User profile preserving social network embed- ding

    [Zhang et al., 2017] Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. User profile preserving social network embed- ding. In IJCAI,

  40. [40]

    Anrl: Attributed network representation learning via deep neural networks

    [Zhang et al., 2018b] Zhen Zhang, Hongxia Yang, Jiajun Bu, Sheng Zhou, Pinggang Yu, Jianwei Zhang, Martin Ester, and Can Wang. Anrl: Attributed network representation learning via deep neural networks. In IJCAI, 2018