pith. sign in

arxiv: 2604.23040 · v1 · submitted 2026-04-24 · 💻 cs.HC · cs.CY

Within-person prediction of depressive symptom change using year-long Screenome data and CES-D assessments

Pith reviewed 2026-05-08 10:25 UTC · model grok-4.3

classification 💻 cs.HC cs.CY
keywords depressive symptomswithin-person predictiondigital phenotypingCES-Dmachine learningsmartphone screenshotssymptom trajectoriesbehavioral features
0
0 comments X

The pith

Smartphone screenshots enable prediction of within-person depressive symptom changes over the next two weeks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper attempts to establish that continuous collection of smartphone screenshots over a year, when combined with biweekly depression assessments, supports accurate forecasts of whether an individual's symptoms will worsen, stay the same, or improve in the following fortnight. A sympathetic reader would care because this could support monitoring systems that flag potential deterioration early enough for targeted care before symptoms reach crisis levels. The models achieve strong performance under temporal holdout, generalize across unseen participants, and identify that a person's typical symptom level is essential for detecting worsening, while behavioral patterns such as rising social media use often precede changes. The work frames the task as within-person classification to handle individual differences and provides a proof-of-concept for passive data in digital phenotyping.

Core claim

By training XGBoost models on over 100 million screenshots and CES-D scores from 96 adults followed for one year, the authors show that symptom change over the subsequent fortnight can be classified with an AUC of 0.906 for crossings of established severity bands and 0.755 for change relative to each person's own variability, generalizing to unseen individuals at an AUC of 0.821, with each person's typical symptom level as the only statistically significant predictor beyond the most recent score and with Screenome-derived features revealing prodromal patterns including escalating social media use, fragmented device engagement, and shifts in overnight activity.

What carries the argument

Within-person XGBoost classification under temporal holdout, using Screenome-derived behavioral features from screenshots together with CES-D scores as predictors of upcoming symptom change operationalized in three clinically meaningful ways.

If this is right

  • Without each person's typical symptom level as a predictor, the models miss the most consequential worsening transitions.
  • Screenome features show prodromal patterns of worsening such as escalating social media use, fragmented device engagement, and changes in overnight activity.
  • The predictive models generalize to individuals not included in the training data.
  • The results establish a foundation for monitoring systems that could identify individuals approaching clinical deterioration.
  • Substantial individual heterogeneity exists in the behavioral signals linked to symptom change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integration into smartphone apps could enable proactive alerts based on passive data without requiring extra user input.
  • A controlled trial testing whether intervening on these predictions reduces actual symptom worsening would be a direct next test of clinical utility.
  • The same passive screenshot approach might be applied to forecast trajectories in other mental health or behavioral domains.
  • Replication across more diverse populations would clarify whether the three ways of defining clinically meaningful change hold broadly.

Load-bearing premise

That the behavioral features extracted from screenshots capture signals predictively linked to upcoming symptom changes rather than merely reflecting correlations within this particular sample.

What would settle it

A replication study in an independent sample where adding the Screenome behavioral features produces no improvement in prediction accuracy over models that use only the most recent CES-D score and the participant's typical symptom level would show that the screenshot data adds no value for early detection.

read the original abstract

Predicting whether an individual's depressive symptoms will worsen, remain stable, or improve over the coming weeks can enable earlier and more targeted care, yet prospective within-person trajectory prediction remains largely unaddressed in digital phenotyping. We combine fortnightly CES-D assessments with over 100 million screenshots captured every five seconds via the Stanford Screenomics platform from 96 adults followed for approximately one year (M = 20.9, SD = 3.9 assessments per participant, 2,002 total observations). We frame prediction as a within-person classification task: whether symptoms will worsen, remain stable, or improve over the subsequent fortnight, operationalized in three ways to capture clinically meaningful change. Under temporal holdout, XGBoost achieves an AUC of 0.906 for crossings of established CES-D severity bands and 0.755 for change relative to each participant's own within-person variability, generalizing to unseen individuals (AUC = 0.821). Each person's typical symptom level was the only statistically significant predictor above the most recent CES-D score; without it, the most consequential worsening transitions go undetected. Screenome-derived behavioral features revealed prodromal patterns of worsening, including escalating social media use, fragmented device engagement, and changes in overnight activity, with substantial individual heterogeneity. These findings establish a proof-of-concept foundation for monitoring systems that could identify individuals approaching clinical deterioration before symptoms reach a crisis point.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript reports a study combining fortnightly CES-D assessments with Screenome data (over 100 million screenshots) from 96 adults over ~1 year to predict within-person depressive symptom changes (worsen, stable, improve) over the next fortnight. Using XGBoost under temporal holdout, it achieves AUC 0.906 for CES-D severity band crossings and 0.755 for within-person variability changes, with generalization AUC 0.821 to unseen individuals. The typical symptom level is identified as the only statistically significant predictor beyond the most recent CES-D, and Screenome features show prodromal patterns like increased social media use.

Significance. If the temporal holdout is implemented without data leakage and the typical symptom level is computed prospectively, this work provides a valuable proof-of-concept for using passive digital phenotyping to forecast depressive symptom trajectories, potentially enabling earlier interventions. The high AUCs and emphasis on individual heterogeneity strengthen the case for personalized monitoring systems in mental health.

major comments (1)
  1. Methods section on feature construction and temporal holdout: The definition and computation of 'each person's typical symptom level' must be clarified. The abstract and results claim it is the only statistically significant predictor, but if this feature is the grand mean across the entire study period (including future observations), it violates the temporal holdout by leaking future CES-D data into predictions at time t. This would inflate the reported AUCs (0.906 and 0.755) and explain the performance drop without it. Specify the exact formula and ensure it uses only historical data up to the prediction point.
minor comments (2)
  1. Abstract: The abstract mentions 'temporal holdout' but does not detail how missing data, hyperparameter tuning, or the three operationalizations of change are handled; adding brief specifics would improve clarity.
  2. Results: Consider adding more details on the statistical tests used to determine 'statistically significant predictor' and report effect sizes or confidence intervals for the AUC values.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying the need for greater precision in our description of the typical symptom level feature. We address this point directly below and will revise the manuscript to eliminate any ambiguity regarding temporal holdout compliance.

read point-by-point responses
  1. Referee: Methods section on feature construction and temporal holdout: The definition and computation of 'each person's typical symptom level' must be clarified. The abstract and results claim it is the only statistically significant predictor, but if this feature is the grand mean across the entire study period (including future observations), it violates the temporal holdout by leaking future CES-D data into predictions at time t. This would inflate the reported AUCs (0.906 and 0.755) and explain the performance drop without it. Specify the exact formula and ensure it uses only historical data up to the prediction point.

    Authors: We agree that explicit clarification is required. In the submitted manuscript the typical symptom level was computed prospectively as the mean of all CES-D scores observed for that participant up to and including the current assessment (i.e., the cumulative mean available at time t). This construction uses only historical data and therefore respects the temporal holdout. However, the Methods section did not state the formula or the prospective nature of the calculation, which understandably raises the concern the referee has identified. We will revise the manuscript to (1) provide the exact formula typical_symptom_level_t = (1/t) * sum(CES-D_1 to CES-D_t), (2) confirm that no future observations are included, and (3) report a sensitivity check showing that performance remains comparable when the feature is instead defined using only data up to t-1. These changes will appear in the Methods, Results, and supplementary materials. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or prediction pipeline

full rationale

The paper frames symptom-change prediction as a standard supervised classification task (XGBoost on Screenome features plus CES-D history) under explicit temporal holdout, with performance reported via AUC on held-out fortnights and generalization to unseen participants. No load-bearing step reduces a claimed prediction to its own inputs by construction: the 'typical symptom level' feature is presented as an empirical predictor without evidence in the text that it is computed from future data or fitted to the target labels in a way that forces the reported AUCs. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core modeling choices. The derivation chain remains self-contained against external benchmarks (temporal holdout and cross-individual generalization), consistent with ordinary applied ML practice in digital phenotyping.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard ML assumptions; full paper would be needed to audit feature definitions and change thresholds.

pith-pipeline@v0.9.0 · 5573 in / 1227 out tokens · 26036 ms · 2026-05-08T10:25:16.192982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 5 canonical work pages

  1. [1]

    Herrman, H. et al. Time for united action on depression: a Lancet-World Psychiatric Association Commission. Lancet 399 , 957–1022 (2022)

  2. [2]

    A., van Hemert, A

    van Eeden, W. A., van Hemert, A. M., Carlier, I. V. E., Penninx, B. W. & Giltay, E. J. Severity, course trajectory, and within-person variability of individual symptoms in patients with major depressive disorder. Acta Psychiatr. Scand. 139 , 194–205 (2019)

  3. [3]

    I., Flake, J

    Fried, E. I., Flake, J. & Robinaugh, D. J. Revisiting the theoretical and methodological foundations of depression measurement. Nat. Rev. Psychol. 1 , 358–368 (2022)

  4. [4]

    Molenaar, P. C. M. A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement (Mahwah NJ) 2 , 201–218 (2004)

  5. [5]

    Why researchers should think ‘within-person’: A paradigmatic rationale

    Hamaker, E. Why researchers should think ‘within-person’: A paradigmatic rationale. Handbook of research methods for studying daily life. 676 , 43–61 (2012)

  6. [6]

    Torous, J. et al. The growing field of digital psychiatry: current evidence and the future of apps, social media, chatbots, and virtual reality. World Psychiatry 20 , 318–335 (2021)

  7. [7]

    V., Lorme, J

    Torous, J., Kiang, M. V., Lorme, J. & Onnela, J.-P. New tools for new research in psychiatry: A scalable and customizable platform to empower data driven smartphone research. JMIR Ment. Health 3 , e16 (2016)

  8. [8]

    C., Zhang, M

    Mohr, D. C., Zhang, M. & Schueller, S. M. Personal sensing: Understanding mental health using ubiquitous sensors and machine learning. Annu. Rev. Clin. Psychol. 13 , 23–47 (2017)

  9. [9]

    Gillan, C. M. & Rutledge, R. B. Smartphones and the neuroscience of mental health. Annu. Rev. Neurosci. 44 , 129–151 (2021)

  10. [10]

    Saeb, S. et al. Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: An exploratory study. J. Med. Internet Res. 17 , e175 (2015)

  11. [11]

    Jacobson, N. C. & Chung, Y. J. Passive sensing of prediction of moment-to-moment depressed mood among undergraduates with clinical levels of depression sample using smartphones. Sensors (Basel) 20 , 3572 (2020)

  12. [12]

    Balliu, B. et al. Personalized mood prediction from patterns of behavior collected with smartphones. NPJ Digit. Med. 7 , 49 (2024)

  13. [13]

    Webb, C. A. et al. Personalized prediction of negative affect in individuals with serious mental illness followed using long-term multimodal mobile phenotyping. Transl. Psychiatry 15 , 174 (2025). 25

  14. [14]

    Vander Zwalmen, Y. et al. Mobile technology for just-in-time prediction of depression: a scoping review. Nat. Ment. Health (2026) doi:10.1038/s44220-026-00624-6

  15. [15]

    Reeves, B. et al. Screenomics: A Framework to Capture and Analyze Personal Life Experiences and the Ways that Technology Shapes Them. Hum Comput Interact 36 , 150–201 (2021)

  16. [16]

    Ram, N. et al. Screenomics: A New Approach for Observing and Studying Individuals’ Digital Lives. J. Adolesc. Res. 35 , 16–50 (2020)

  17. [17]

    & Ram, N

    Reeves, B., Robinson, T. & Ram, N. Time for the Human Screenome Project. Nature 577 , 314–317 (2020)

  18. [18]

    Kim, I. et al. An open-source platform for multimodal digital trace data collection from smartphones. Nat. Health 1–12 (2026) doi:10.1038/s44360-026-00072-7

  19. [19]

    Cerit, M. et al. Person-specific analyses of smartphone use and mental health: Intensive longitudinal study. JMIR Form. Res. 9 , e59875 (2025)

  20. [20]

    Ren, B. et al. Predicting states of elevated negative affect in adolescents from smartphone sensors: a novel personalized machine learning approach. Psychol. Med. 53 , 5146–5154 (2023)

  21. [21]

    & Webb, C

    Fisher, H., Nepal, S. & Webb, C. A. Personalized early detection of depression onset using multivariate mobile passive sensing. Research Square (2026) doi:10.21203/rs.3.rs-8960944/v1

  22. [22]

    Amin, R. et al. Use of mobile sensing data for longitudinal monitoring and prediction of depression severity: Systematic review. J. Med. Internet Res. 27 , e57418 (2025)

  23. [23]

    Leaning, I. E. et al. From smartphone data to clinically relevant predictions: A systematic review of digital phenotyping methods in depression. Neurosci. Biobehav. Rev. 158 , 105541 (2024)

  24. [24]

    De Angel, V. et al. Digital health tools for the passive monitoring of depression: a systematic review of methods. NPJ Digit. Med. 5 , 3 (2022)

  25. [25]

    Stamatis, C. A. et al. Differential temporal utility of passively sensed smartphone features for depression and anxiety symptom prediction: a longitudinal cohort study. Npj Ment. Health Res. 3 , 1 (2024)

  26. [26]

    & Lepach-Engelhardt, A

    Zierer, C., Behrendt, C. & Lepach-Engelhardt, A. C. Digital biomarkers in depression: A systematic review and call for standardization and harmonization of feature engineering. J Affect Disord 356 , 438–449 (2024)

  27. [27]

    & Przybylski, A

    Orben, A. & Przybylski, A. K. The association between adolescent well-being and digital technology use. Nat. Hum. Behav. 3 , 173–182 (2019)

  28. [28]

    R., King, G., Vize, C

    Ringwald, W. R., King, G., Vize, C. E. & Wright, A. G. C. Passive smartphone sensors for detecting psychopathology. JAMA Netw. Open 8 , e2519047 (2025). 26

  29. [29]

    & Blakemore, S.-J

    Orben, A., Meier, A., Dalgleish, T. & Blakemore, S.-J. Mechanisms linking social media use to adolescent mental health vulnerability. Nat. Rev. Psychol. 3 , 407–423 (2024)

  30. [30]

    Radloff, L. S. The CES-D scale: a self-report depression scale for research in the general population. Appl. Psychol. Meas. 1 , 385–401 (1977)

  31. [31]

    H., Dissing, A

    Rod, N. H., Dissing, A. S., Clark, A., Gerds, T. A. & Lund, R. Overnight smartphone use: A new public health challenge? A novel study design based on high-resolution smartphone data. PLoS One 13 , e0204811 (2018)

  32. [32]

    Teenagers, screens and social media: a narrative review of reviews and key studies

    Orben, A. Teenagers, screens and social media: a narrative review of reviews and key studies. Soc. Psychiatry Psychiatr. Epidemiol. 55 , 407–414 (2020)

  33. [33]

    & Robinson, T

    Cho, M.-J., Reeves, B., Ram, N. & Robinson, T. N. Balancing media selections over time: Emotional valence, informational content, and time intervals of use. Heliyon 9 , e22816 (2023)

  34. [34]

    L., Forcier, M

    Rushton, J. L., Forcier, M. & Schectman, R. M. Epidemiology of depressive symptoms in the National Longitudinal Study of Adolescent Health. J. Am. Acad. Child Adolesc. Psychiatry 41 , 199–205 (2002)

  35. [35]

    Park, S.-H. & Yu, H. Y. How useful is the center for epidemiologic studies depression scale in screening for depression in adults? An updated systematic review and meta-analysis ✰ . Psychiatry Res. 302 , 114037 (2021)

  36. [36]

    Ram, N., Haber, N., Robinson, T. N. & Reeves, B. Binding the Person-Specific Approach to Modern AI in the Human Screenome Project: Moving past Generalizability to Transferability. Multivariate Behav Res 59 , 1211–1219 (2024)

  37. [37]

    Kathan, A. et al. Personalised depression forecasting using mobile sensor data and ecological momentary assessment. Front. Digit. Health 4 , 964582 (2022)

  38. [38]

    H., Hyde, J

    Salk, R. H., Hyde, J. S. & Abramson, L. Y. Gender differences in depression in representative national samples: Meta-analyses of diagnoses and symptoms. Psychol Bull 143 , 783–822 (2017)

  39. [39]

    Winbush, A. et al. Smartphone use in a large US adult population: Temporal associations between objective measures of usage and mental well-being. Proc Natl Acad Sci U S A 122 , e2427311122 (2025)

  40. [40]

    Cerit, M. et al. Media content atlas: A pipeline to explore and investigate multidimensional media space using multimodal LLMs. in Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems 1–13 (ACM, New York, NY, USA, 2025). doi:10.1145/3706599.3720055

  41. [41]

    Y., Choi, E

    Chin, W. Y., Choi, E. P. H., Chan, K. T. Y. & Wong, C. K. H. The psychometric properties of the center for Epidemiologic Studies Depression Scale in Chinese primary care patients: Factor structure, construct 27 validity, reliability, sensitivity and responsiveness. PLoS One 10 , e0135131 (2015)

  42. [42]

    & Przybylski, A

    Orben, A., Dienlin, T. & Przybylski, A. K. Social media’s enduring effect on adolescent life satisfaction. Proc. Natl. Acad. Sci. U. S. A. 116 , 10226–10228 (2019)

  43. [43]

    Radford, A. et al. Learning transferable visual models from natural language supervision. ICML 139 , 8748–8763 (2021)

  44. [44]

    & Hastie, T

    Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67 , 301–320 (2005)

  45. [45]

    XGBoost: A Scalable Tree Boosting System

    Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, New York, NY, USA, 2016). doi:10.1145/2939672.2939785

  46. [46]

    Ke, G. et al. LightGBM: A highly efficient gradient Boosting Decision Tree. https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (2017)

  47. [47]

    & Vapnik, V

    Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20 , 273–297 (1995)

  48. [48]

    Non-White

    GitHub - mediacontentatlas/within-person-cesd-screenome. GitHub. https://github.com/mediacontentatlas/within-person-cesd-screenome. 28 Supplementary Materials Table S1 | Classification performance across all four models and all three label operationalizations (held-out test set, N = 411). All three labels are three-class (improving, stable, worsening); AU...