pith. sign in

arxiv: 2604.24611 · v1 · submitted 2026-04-27 · 💻 cs.LG

Uncovering Latent Patterns in Social Media Usage and Mental Health: A Clustering-Based Approach Using Unsupervised Machine Learning

Pith reviewed 2026-05-08 04:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords social media usagemental healthK-Means clusteringunsupervised machine learninganxietydepressionuser segmentationsurvey data
0
0 comments X

The pith

K-Means clustering applied to 551 survey responses uncovers six distinct patterns linking social media use to mental health indicators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that unsupervised clustering can segment users into groups based on their social media habits and mental health measures such as anxiety, depression, loneliness, and sleep quality. Using data from an online survey, the authors preprocess it with imputation and encoding, then apply K-Means to find an optimal six clusters. This reveals hidden structures beyond simple correlations, like the link between time spent on social media and anxiety scores. A reader would care if these groups point to different risk levels that could inform personalized advice or further research on digital well-being.

Core claim

The study claims that K-Means clustering, with the number of clusters set to six based on the Elbow Method and a Silhouette Score of 0.32, applied after PCA and preprocessing of survey data, identifies latent patterns in social media usage and mental health, including a 0.28 correlation between social media hours and anxiety.

What carries the argument

K-Means clustering optimized via Elbow Method and Silhouette Score on preprocessed survey data, with PCA for visualization.

If this is right

  • The six clusters can be interpreted as different user profiles with varying mental health implications from social media use.
  • Correlation analysis supports targeted examination of relationships within each cluster.
  • Visualization with PCA allows for better understanding of how the clusters separate in reduced dimensions.
  • This clustering approach offers a method to move from general associations to specific segmented insights in mental health studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If these clusters prove stable, the method could be tested on larger or more diverse populations to see if the same six profiles appear.
  • Future work might track individuals over time to determine if cluster assignment predicts mental health changes.
  • The findings could imply that public health messages about social media should consider different user types rather than a one-size-fits-all approach.

Load-bearing premise

The survey responses from 551 participants truly represent actual social media behaviors and mental health states, and the six clusters are real patterns rather than results of the particular data cleaning or sample chosen.

What would settle it

Collecting a new independent sample of similar participants and applying the same clustering procedure to find that the optimal cluster count is not six or that the Silhouette Score is much lower would indicate the patterns are not general.

Figures

Figures reproduced from arXiv: 2604.24611 by Mahfuza Khatun, Md All Shahria, Mohammad Sakib Mahmood, Sanjeda Dewan Mithila, Touhid Alam.

Figure 1
Figure 1. Figure 1: Methodology of the proposed study. media (categorized), primary platform used, primary reason for use (e.g., per￾sonal, professional, both), and the frequency of taking deliberate breaks from social media. • Mental Health Indicators: A series of questions adapted from established psy￾chological screening tools, using a 5-point Likert scale (e.g., 1=Never, 2=Rarely, 3=Sometimes, 4=Often, 5=Always), assessed… view at source ↗
Figure 2
Figure 2. Figure 2: Elbow Method and Silhouette Score for Optimal K. The Elbow Method shows diminishing returns after K=5, while the Silhouette Score peaks at K=6, justifying the selection of six clusters. tributed equally to the clustering process, numerical features were standardized using the ‘StandardScaler‘. This process transforms each feature to have a mean of 0 and a standard deviation of 1, placing all features on a … view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Social Media Hours Categories within Each Cluster. 3.4. Cluster Visualization and Interpretation To interpret the high-dimensional results, Principal Component Analysis (PCA) was employed as a dimensionality reduction technique. PCA transforms the original 22 correlated features into a smaller set of uncorrelated principal components. By pro￾jecting the data onto the first two principal com… view at source ↗
Figure 4
Figure 4. Figure 4: PCA Visualization of Clusters in 2D Space. The plot shows the separation of the six clusters along the first two principal components view at source ↗
Figure 5
Figure 5. Figure 5: Heatmap of Cluster Centers. Each cell represents the mean scaled value (0-1) of a feature (row) for a given cluster (column 0-5). Brighter colors indicate higher mean values. This visually profiles the defining characteristics of each user segment. are moderate relative to the other clusters. 4.0.0.2. Cluster 1 (Anxious Student Users). This is a younger, predominantly female student group. Despite having o… view at source ↗
read the original abstract

The widespread adoption of social media has heightened interest in its psychological effects, particularly on mental health indicators such as anxiety, depression, loneliness, and sleep quality, as these platforms increasingly influence social interactions and well-being. Although previous research has examined correlations between social media use and mental health, few studies have utilized unsupervised machine learning to segment users based on behavioral and psychological patterns, leaving a gap in identifying distinct risk profiles across diverse groups. This study seeks to address this by segmenting individuals according to their social media usage and psychological well-being, employing clustering to reveal hidden patterns and evaluate their mental health implications. Data from 551 participants, collected via an online survey, were preprocessed using KNN imputation for missing values, one-hot encoding for categorical variables like Gender with 5 unique values, and outlier detection via IQR and Z-score methods. K-Means clustering, optimized at 6 clusters using the Elbow Method and a Silhouette Score of 0.32, was applied, with PCA reducing 22 dimensions for visualization and a correlation heatmap highlighting relationships, such as a 0.28 correlation between social media hours and anxiety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper applies K-Means clustering to preprocessed survey data from 551 participants to segment users by social media usage and mental health variables (anxiety, depression, loneliness, sleep quality). It uses KNN imputation, one-hot encoding, IQR/Z-score outlier handling, selects k=6 via Elbow method plus Silhouette score of 0.32, reduces dimensions with PCA for visualization, and reports correlations such as 0.28 between social media hours and anxiety, claiming to uncover latent risk profiles.

Significance. If the clusters prove stable and interpretable beyond the specific sample, the work could help identify heterogeneous patterns linking social media behavior to mental health, supporting more targeted public-health insights. The use of unsupervised methods on combined behavioral and psychological features is a reasonable exploratory step, though the low separation metric limits immediate impact.

major comments (3)
  1. [Clustering results] Clustering results (K-Means section): The Silhouette score of 0.32 for the chosen 6-cluster solution lies at the lower end of the 'fair' range and indicates substantial overlap; without reported Silhouette scores for k=5 and k=7, bootstrap stability, or comparison to GMM or hierarchical clustering, it is unclear whether the solution captures distinct latent patterns or is sensitive to the 551-participant sample and preprocessing choices.
  2. [Methods] Methods (preprocessing and validation): No external validation, cross-validation, or stability checks (e.g., adjusted Rand index across random seeds or subsamples) are reported for the cluster assignments, making the claim that the groups represent meaningful 'risk profiles' rest on a single run of K-Means after KNN imputation and IQR/Z-score cleaning whose parameters are not fully specified.
  3. [Results] Results (interpretation): The manuscript interprets the 6 clusters as latent risk profiles without providing cluster-wise statistics on the original variables or testing whether the groupings predict external outcomes (e.g., clinical thresholds for anxiety), so the central claim of uncovering hidden patterns lacks a falsifiable link to mental-health implications.
minor comments (3)
  1. [Abstract and Methods] The abstract and methods should explicitly list the 22 features used before PCA and state the exact distance metric and initialization for K-Means.
  2. [Figures] Figure captions for the PCA visualization and correlation heatmap should include axis labels, explained variance ratios, and the exact correlation coefficient values rather than a single example (0.28).
  3. [Discussion] Add a brief discussion of potential selection bias in the online survey sample and how the 551 participants compare demographically to broader populations.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and have revised the paper to incorporate additional analyses and clarifications where feasible.

read point-by-point responses
  1. Referee: [Clustering results] Clustering results (K-Means section): The Silhouette score of 0.32 for the chosen 6-cluster solution lies at the lower end of the 'fair' range and indicates substantial overlap; without reported Silhouette scores for k=5 and k=7, bootstrap stability, or comparison to GMM or hierarchical clustering, it is unclear whether the solution captures distinct latent patterns or is sensitive to the 551-participant sample and preprocessing choices.

    Authors: We acknowledge that a Silhouette score of 0.32 reflects only moderate separation, which is expected in noisy survey data combining behavioral and psychological measures. In the revised manuscript, we now report Silhouette scores for k=5 and k=7 to allow readers to evaluate the choice of k=6. We have added a stability analysis by repeating K-Means across multiple random seeds and report the consistency of assignments. We have also included a comparison to hierarchical clustering in the supplementary materials to show that the selected solution is robust for this dataset. revision: yes

  2. Referee: [Methods] Methods (preprocessing and validation): No external validation, cross-validation, or stability checks (e.g., adjusted Rand index across random seeds or subsamples) are reported for the cluster assignments, making the claim that the groups represent meaningful 'risk profiles' rest on a single run of K-Means after KNN imputation and IQR/Z-score cleaning whose parameters are not fully specified.

    Authors: We have updated the Methods section to fully specify all preprocessing parameters, including the number of neighbors for KNN imputation and the exact thresholds applied in IQR and Z-score outlier detection. We now include internal stability checks by running K-Means with varied random seeds and computing the adjusted Rand index to quantify consistency across runs. A subsampling analysis has also been added to assess cluster stability on data subsets. External validation is not possible with the existing survey data alone. revision: yes

  3. Referee: [Results] Results (interpretation): The manuscript interprets the 6 clusters as latent risk profiles without providing cluster-wise statistics on the original variables or testing whether the groupings predict external outcomes (e.g., clinical thresholds for anxiety), so the central claim of uncovering hidden patterns lacks a falsifiable link to mental-health implications.

    Authors: We have added a table in the Results section with cluster-wise means and standard deviations for all original variables, enabling direct inspection of the distinct profiles. This provides a clearer basis for interpreting the groups as risk profiles. However, the survey collected only continuous self-report scores and does not include clinical diagnostic thresholds or external outcome measures, so direct testing of predictive validity against clinical categories cannot be performed with the available data. The discussion has been expanded to better situate the clusters within existing mental health literature. revision: partial

standing simulated objections not resolved
  • The survey data does not include clinical thresholds or external outcome measures, so the clusters cannot be tested for prediction of clinical anxiety or similar binary outcomes.

Circularity Check

0 steps flagged

No circularity: standard unsupervised clustering applied directly to input data

full rationale

The paper's core steps consist of preprocessing survey responses (KNN imputation, one-hot encoding, IQR/Z-score outlier removal) followed by K-Means clustering whose k=6 is selected via Elbow Method and Silhouette Score computed on the same data, plus PCA for visualization and a correlation matrix. None of these operations reduce by construction to previously fitted parameters, self-definitions, or self-citation chains; the cluster labels and scores are direct algorithmic outputs from the 551-participant feature matrix. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear. The derivation chain remains self-contained against the raw survey inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of self-reported survey data and the choice of clustering parameters without independent verification.

free parameters (1)
  • Number of clusters = 6
    Selected using the Elbow Method on the data.
axioms (1)
  • domain assumption The survey responses accurately reflect participants' social media usage and mental health states.
    Assumed for the clustering to be meaningful.

pith-pipeline@v0.9.0 · 5522 in / 1383 out tokens · 53465 ms · 2026-05-08T04:12:20.767286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 13 canonical work pages

  1. [1]

    Alonzo, R., Hussain, M., et al. (2023). Social media and depression: Evidence from Twitter sentiment analysis. Computers in Human Behavior, 139, 107548. doi:10.1016/j.chb.2022.107548

  2. [2]

    Boer, M., Stevens, G., et al. (2022). Social media use and adolescent mental health: A longitudinal study. Developmental Psychology, 58, 1483--1496. doi:10.1037/dev0001378

  3. [3]

    Braghieri, L., Levy, R., & Makarin, A. (2022). Social media and mental health. American Economic Review, 112, 3660–-3693. doi:10.1257/aer.20211218

  4. [4]

    Kim, D., Lee, S., & Shin, Y. (2024). Digital stress and adolescent mental health: A clustering analysis. BMC Public Health, 24, 1--12. doi:10.1186/s12889-024-17985-6

  5. [5]

    Kim, J., Lee, H., et al. (2024). Detecting anxiety through Instagram: A deep learning approach. IEEE Transactions on Affective Computing, 15, 512–-523. doi:10.1109/TAFFC.2023.3289471

  6. [6]

    Li, Y., Zhang, X., et al. (2023). Social media use and loneliness among Chinese adults. Journal of Affective Disorders, 320, 123–-130. doi:10.1016/j.jad.2022.09.045

  7. [7]

    Naslund, J.A., Bondre, A., Torous, J., & Aschbrenner, K.A. (2020). Social media and mental health: Benefits, risks, and opportunities. Journal of Technology in Behavioral Science, 5, 245–-257. doi:10.1007/s41347-020-00094-8

  8. [8]

    Orben, A., Meier, A., et al. (2022). The effect of social media on well-being differs across adolescent life. Nature Communications, 13, 1--11. doi:10.1038/s41467-022-29836-7

  9. [9]

    Pew Research Center. (2022). Social Media Use in 2022. Washington, DC, USA: Author

  10. [10]

    Plackett, R., Blyth, A., & Schartau, P. (2023). The impact of social media use interventions on mental well-being: Systematic review. Journal of Medical Internet Research, 25, e44922. doi:10.2196/44922

  11. [11]

    Shannon, H., Bush, K., Villeneuve, P.J., Hellemans, K.G.C., & Guimond, S. (2022). Problematic social media use in adolescents and young adults: Systematic review and meta-analysis. JMIR Mental Health, 9, e33450. doi:10.2196/33450

  12. [13]

    Wang, Y., Chen, L., & Liu, X. (2022). Social media usage patterns and adolescent mental health: A machine learning approach. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data) (pp. 1234–-1240). Osaka, Japan. doi:10.1109/BigData55555.2022.00034

  13. [14]

    Winstone, L., Mars, B., Haworth, C.M., & Kidger, J. (2022). Types of social media use and digital stress in early adolescence. Journal of Early Adolescence, 42, 1–-25. doi:10.1177/02724316221105560

  14. [15]

    WHO Regional Office for Europe. (2024). Teens, screens and mental health: Health behaviour in school-aged children (HBSC) study. Springer Discovery Mental Health, 4, 40. doi:10.1007/s44192-024-00087-0

  15. [16]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  16. [17]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  17. [18]

    ^z 5q7o*s 7еkWڴiC *x 8pݻ ;дiS6mo]yΝ; ;v,]tu ̝;7<Stޝ޽ rxׯ Z ܸ19sfӴiSm Z,YѣGөS'ڷoyתN .Ԋ + ;v, UVݛ =ߑ 5 ؾ ;W_ 5mڴW^ c `̘1 3&jC =D

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...