Recognition: unknown
Quantifying the human visual exposome with vision language models
Pith reviewed 2026-05-07 04:07 UTC · model grok-4.3
The pith
Vision language models can quantify the visual exposome from everyday photos and link it to mental health outcomes like stress and mood.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By processing 2674 participant-generated photographs, VLM-derived greenness estimates robustly predicted momentary affect and chronic stress, matching established benchmarks. A semi-autonomous LLM pipeline mined over seven million publications to derive nearly 1000 environmental features linked to mental health; when these features were rated from real-world imagery, up to 33 percent of the VLM context ratings showed significant correlations with affect and stress measures. This work sets out a scalable objective approach to visual exposomics that decodes associations between the visible world and mental health.
What carries the argument
Vision language models applied to participant photographs to generate semantic ratings of visual context, augmented by a semi-autonomous LLM pipeline that extracts empirically linked environmental features from scientific literature.
If this is right
- Greenness visible in daily scenes can be treated as a predictor of immediate emotional state and chronic stress.
- Nearly 1000 literature-derived environmental features become assessable from ordinary images for their mental health relevance.
- High-throughput objective analysis of personal visual environments replaces reliance on geospatial proxies or self-reports.
- A scalable paradigm emerges for linking specific visible features to mental health across large numbers of people and settings.
Where Pith is reading between the lines
- Phone-based photo analysis could support personal tools that flag environments associated with better or worse mood.
- The approach could extend to other outcomes such as physical activity levels or sleep quality by applying the same feature ratings.
- Urban planning might use quantified visual metrics from many users to evaluate and improve public spaces for psychological effects.
Load-bearing premise
Participant photographs accurately capture first-person daily visual context and VLM-derived semantic ratings validly reflect environmental features that are relevant to mental health outcomes.
What would settle it
A replication study in which VLM greenness ratings from new participant photos show no correlation with independent greenness measures or with affect and stress scores would falsify the central claim.
Figures
read the original abstract
The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a pipeline coupling ecological momentary assessment with vision-language models (VLMs) to quantify semantic features of the visual exposome from 2674 participant-generated photographs. VLM-derived greenness scores are shown to robustly predict momentary affect and chronic stress in line with prior benchmarks. An LLM-based literature-mining pipeline extracts nearly 1000 environmental features previously linked to mental health; when these features are rated by VLMs on the same images, up to 33% of the ratings correlate significantly with affect and stress measures. The work positions this approach as a scalable, objective paradigm for visual exposomics.
Significance. If the reported associations survive appropriate statistical controls, the study offers a genuinely new, high-throughput route to operationalize the visual component of the exposome. The greenness result is anchored to external benchmarks and therefore provides a credible proof-of-concept; the broader feature set, if validated, would enable systematic discovery of visual correlates of mental health without reliance on coarse geospatial or self-report data. The combination of VLM image understanding with LLM-driven literature synthesis is technically timely and could be adopted by exposome and environmental-psychology researchers.
major comments (3)
- [Results and Methods (statistical analysis)] Results (paragraph reporting the 33% figure) and Methods (statistical analysis subsection): the manuscript states that up to 33% of the ~1000 VLM-derived context ratings showed significant correlations with affect and stress at a nominal p<0.05 threshold. No mention is made of family-wise error rate control, false-discovery-rate adjustment, or permutation testing across the full feature set. Under the global null, ~50 false positives are expected; without correction the 33% figure is consistent with noise and does not support the claim that the pipeline yields a scalable set of replicable visual exposome markers. Greenness is a single pre-specified test and is therefore exempt, but the multi-feature claim is load-bearing for the central paradigm argument.
- [Methods (VLM rating and validation)] Methods (VLM rating and validation subsection): the paper provides no quantitative validation of VLM semantic ratings against human raters for the 1000 literature-derived features (only greenness is benchmarked). Without inter-rater reliability metrics or a held-out human-annotated subset, it is unclear whether the reported correlations reflect true environmental features or VLM-specific biases in prompt interpretation.
- [Methods (participant and image acquisition)] Methods (participant and image acquisition subsection): the manuscript does not report participant demographics, inclusion/exclusion criteria, or the protocol used to select and upload the 2674 photographs. These details are required to evaluate selection bias and to determine whether the correlations generalize beyond the sampled population and photo-taking behavior.
minor comments (2)
- [Abstract] Abstract: the phrase 'up to 33 percent' is imprecise; the exact number of features tested and the precise percentage that survived any correction should be stated.
- [Methods and Figure legends] Figure legends and Methods: clarify the exact VLM and LLM model versions, temperature settings, and prompt templates used, as these choices can materially affect rating distributions.
Simulated Author's Rebuttal
We thank the referee for their insightful and constructive comments, which have helped us identify key areas to improve the rigor and transparency of our study. We provide point-by-point responses below, outlining the revisions we intend to make to address each concern.
read point-by-point responses
-
Referee: [Results and Methods (statistical analysis)] Results (paragraph reporting the 33% figure) and Methods (statistical analysis subsection): the manuscript states that up to 33% of the ~1000 VLM-derived context ratings showed significant correlations with affect and stress at a nominal p<0.05 threshold. No mention is made of family-wise error rate control, false-discovery-rate adjustment, or permutation testing across the full feature set. Under the global null, ~50 false positives are expected; without correction the 33% figure is consistent with noise and does not support the claim that the pipeline yields a scalable set of replicable visual exposome markers. Greenness is a single pre-specified test and is therefore exempt, but the multi-feature claim is load-bearing for the central paradigm argument.
Authors: We agree with the referee that appropriate correction for multiple comparisons is essential to substantiate the claim regarding the broader feature set. In the revised manuscript, we will apply the Benjamini-Hochberg false discovery rate procedure to the p-values from the ~1000 correlations and report the number of features that survive correction at q < 0.05. We will also include results from permutation testing, where we randomly permute the affect and stress scores across participants 1000 times to generate a null distribution for the proportion of significant correlations. These additions will be detailed in the Methods and presented in the Results, allowing us to distinguish signal from noise more reliably while preserving the pre-specified greenness analysis. revision: yes
-
Referee: [Methods (VLM rating and validation)] Methods (VLM rating and validation subsection): the paper provides no quantitative validation of VLM semantic ratings against human raters for the 1000 literature-derived features (only greenness is benchmarked). Without inter-rater reliability metrics or a held-out human-annotated subset, it is unclear whether the reported correlations reflect true environmental features or VLM-specific biases in prompt interpretation.
Authors: We acknowledge the importance of validating the VLM ratings for the literature-derived features. Although the greenness score was validated against established benchmarks, we did not extend this to the full set of features. In the revised manuscript, we will add a new validation analysis based on a randomly selected subset of 50 features. Two independent human raters will score these features on a held-out set of 100 images, and we will compute inter-rater agreement metrics (e.g., intraclass correlation coefficients) as well as agreement between the VLM and the average human ratings. This will provide quantitative evidence regarding the fidelity of the VLM outputs to human perception of the visual environment. revision: yes
-
Referee: [Methods (participant and image acquisition)] Methods (participant and image acquisition subsection): the manuscript does not report participant demographics, inclusion/exclusion criteria, or the protocol used to select and upload the 2674 photographs. These details are required to evaluate selection bias and to determine whether the correlations generalize beyond the sampled population and photo-taking behavior.
Authors: We apologize for not including these details in the main text. The full protocol for participant recruitment, inclusion and exclusion criteria, demographic characteristics of the sample (including age, sex, ethnicity, and education level), and the specific instructions and technical procedures for photograph acquisition and upload via the ecological momentary assessment app are described in the Supplementary Information. In the revised manuscript, we will add a dedicated paragraph in the Methods section summarizing these aspects, including key sample statistics, to facilitate assessment of selection bias and generalizability. revision: yes
Circularity Check
No significant circularity; derivation chain is externally anchored
full rationale
The paper's core chain proceeds as follows: participant photos are rated by VLMs for greenness and for ~1000 literature-mined environmental features; these ratings are then correlated with independent EMA measures of affect and stress. Greenness results are explicitly stated to be consistent with external benchmarks rather than derived from them. The feature list is extracted from over seven million external publications (not self-citations), and the subsequent correlations constitute new empirical tests on the photo dataset rather than tautological outputs. No equations, fitted parameters, or uniqueness claims reduce the reported associations to the inputs by construction. The statistical threshold issue raised in the skeptic note is a separate validity concern and does not constitute circularity under the defined criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM outputs provide valid, unbiased estimates of semantic environmental features such as greenness that are relevant to mental health.
Reference graph
Works this paper leans on
-
[1]
Arias, D., Saxena, S. & Verguet, S. Quantifying the global burden of mental disorders and their economic value. EClinicalMedicine 54, 101675; 10.1016/j.eclinm.2022.101675 (2022)
-
[2]
Xu, J. et al. Effects of urban living environments on mental health in adults. Nat Med 29, 1456–1467; 10.1038/s41591-023-02365-w (2023)
-
[3]
Fong, K. C., Hart, J. E. & James, P. A review of epidemiologic studies on greenness and health: Updated literature through 2017. Curr Environ Health Rep 5, 77–87; 10.1007/s40572-018-0179-y (2018)
-
[4]
Tran, I., Sabol, O. & Mote, J. The relationship between greenspace exposure and psychopathology symptoms: A systematic review. Biol Psychiatry Glob Open Sci 2, 206–222; 10.1016/j.bpsgos.2022.01.004 (2022)
-
[5]
Li, F. et al. Global association of greenness exposure with risk of nervous system disease: A systematic review and meta-analysis. Sci Total Environ 877, 162773; 10.1016/j.scitotenv.2023.162773 (2023)
-
[6]
Green cities for better health
O’Leary, K. Green cities for better health. Nat Med; 10.1038/d41591-025-00001-3 (2025)
-
[7]
Hunter, R. F. et al. Integrating accelerometry, GPS, GIS and molecular data to investigate mechanistic pathways of the urban environmental exposome and cognitive outcomes in older adults: a longitudinal study protocol. BMJ Open 14, e085318; 10.1136/bmjopen-2024-085318 (2024). 28
-
[8]
Cohen-Cline, H., Turkheimer, E. & Duncan, G. E. Access to green space, physical activity and mental health: A twin study. J Epidemiol Community Health 69, 523–529; 10.1136/jech-2014-204667 (2015)
-
[9]
Liu, Y., Kwan, M.-P. & Yu, C. The uncertain geographic context problem (UGCoP) in measuring people’s exposure to green space using the integrated 3S approach. Urban Forestry & Urban Greening 85, 127972; 10.1016/j.ufug.2023.127972 (2023)
-
[10]
Tost, H. et al. Neural correlates of individual differences in affective benefit of real-life urban green space exposure. Nat Neurosci 22, 1389–1393; 10.1038/s41593-019-0451-y (2019)
-
[11]
Montone, R. A. et al. Exposome in ischaemic heart disease: beyond traditional risk factors. Eur Heart J 45, 419–438; 10.1093/eurheartj/ehae001 (2024)
-
[12]
Simonienko, K. et al. The impact of urban flower meadows on the well-being of city dwellers provides hints for planning biophilic green spaces. Sci Rep 15, 31981; 10.1038/s41598-025-16420-8 (2025)
-
[13]
Jih, J., Nguyen, A., Woo, J., Ly, A. & Shim, J. K. Using photographs to understand the context of health: A novel two-step systematic process for coding visual data. Qual Health Res 33, 1049–1058; 10.1177/10497323231198196 (2023)
-
[14]
Padgett, D. K., Smith, B. T., Derejko, K.-S., Henwood, B. F. & Tiderington, E. A picture is worth . . . ? Photo elicitation interviewing with formerly homeless adults. Qual Health Res 23, 1435–1444; 10.1177/1049732313507752 (2013)
-
[15]
Ritondo, T., Bean, C. & Lesser, I. Pictures and processes: The use of autophotography to illustrate the experience of physical activity engagement in motherhood. Methods Psychol 10, 100139; 10.1016/j.metip.2024.100139 (2024). 29
-
[16]
Zhou, X. et al. Vision language models in autonomous driving: A survey and outlook. IEEE Trans Intell Veh; 10.1109/TIV.2024.3402136 (2024)
-
[17]
Tian, X. et al. DriveVLM: The convergence of autonomous driving and large vision- language models. arXiv preprint arXiv:2402.12289 (2024)
work page internal anchor Pith review arXiv 2024
-
[18]
Shiffman, S., Stone, A. A. & Hufford, M. R. Ecological momentary assessment. Annu Rev Clin Psychol 4, 1–32; 10.1146/annurev.clinpsy.3.022806.091415 (2008)
-
[19]
Spano, G. et al. Objective greenness, connectedness to nature and sunlight levels towards perceived restorativeness in urban nature. Sci Rep 13, 18192; 10.1038/s41598- 023-45604-3 (2023)
-
[20]
Li, C., Sun, C., Sun, M., Yuan, Y. & Li, P. Effects of brightness levels on stress recovery when viewing a virtual reality forest with simulated natural light. Urban For Urban Green 56, 126865; 10.1016/j.ufug.2020.126865 (2020)
-
[21]
Patterson, Baptiste Barbot, James Lloyd-Cox, and Roger E
Lewetz, D. & Stieger, S. ESMira: A decentralized open-source application for collecting experience sampling data. Behav Res Methods 56, 4421–4434; 10.3758/s13428-023- 02194-2 (2023)
-
[22]
Ebner-Priemer, U. W., Reichert, M., Tost, H. & Meyer-Lindenberg, A. Wearables zum kontextgesteuerten Assessment in der Psychiatrie. Nervenarzt 90, 1207–1214; 10.1007/s00115-019-00815-w (2019)
-
[23]
Lu, Y. et al. Wearable data link urban green space to physical activity. Nat. Health 1, 67–77; 10.1038/s44360-025-00011-y (2025)
-
[24]
Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med 30, 2886–2896; 10.1038/s41591-024-03139-8 (2024). 30
-
[25]
Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat Med 31, 3394–3403; 10.1038/s41591-025-03888-0 (2025)
-
[26]
Roberts, M. C., Holt, K. E., Del Fiol, G., Baccarelli, A. A. & Allen, C. G. Precision public health in the era of genomics and big data. Nat Med 30, 1865–1873; 10.1038/s41591-024-03098-0 (2024)
-
[27]
W., Egloff, B., Kohlmann, C
Krohne, H. W., Egloff, B., Kohlmann, C. & Tausch, A. Untersuchungen mit einer deutschen Version der ‚Positive and Negative Affect Schedule‘ (PANAS). Diagnostica 42, 139–156 (1996)
1996
-
[28]
Revelle, W. & Wilt, J. Analyzing dynamic data: A tutorial. Personality and Individual Differences 136, 38–51; 10.1016/j.paid.2017.08.020 (2019)
-
[29]
A global measure of perceived stress,
Cohen, S., Kamarck, T. & Mermelstein, R. A global measure of perceived stress. J Health Soc Behav 24, 385-396; 10.2307/2136404 (1983)
-
[30]
Harris, K. M., Gaffey, A. E., Schwartz, J. E., Krantz, D. S. & Burg, M. M. The perceived stress scale as a measure of stress: Decomposing score variance in longitudinal behavioral medicine studies. Ann Behav Med 57, 846–854; 10.1093/abm/kaad015 (2023)
-
[31]
E., Schönfelder, S., Domke-Wolf, M
Schneider, E. E., Schönfelder, S., Domke-Wolf, M. & Wessa, M. Measuring stress in clinical and nonclinical subjects using a German adaptation of the Perceived Stress Scale. Int J Clin Health Psychol 20, 173–181; 10.1016/j.ijchp.2020.03.004 (2020)
-
[32]
R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, 2023)
R Core Team. R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, 2023)
2023
-
[33]
Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Statistical Softw. 67, 1–48; 10.18637/jss.v067.i01 (2015). 31 Inventory of Supplemental Information Supplementary Table 1. Original and mean values of VLM -greenness-related image metrics over 5 runs for LLaMA4 VLM and Qwen3 VL Supplementary Table 2. Multile...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.