arxiv: 2605.03863 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.CV

Recognition: unknown

Quantifying the human visual exposome with vision language models

Christian Rominger (1) , Andreas R. Schwerdtfeger (1) , Malay Gaherwar Singh (2) , Dimitri Khudyakow (2) , Elizabeth A. M. Michels (2) , Fabian Wolf (2) , Jakob Nikolas Kather (2 , 3

show 5 more authors

4) Magdalena Katharina Wekenborg (2) ((1) University of Graz (2) TU Dresden (3) University Hospital Carl Gustav Carus Dresden (4) National Center for Tumor Diseases Heidelberg)

Authors on Pith no claims yet

Pith reviewed 2026-05-07 04:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords visual exposomevision language modelsmental healthecological momentary assessmentgreennessenvironmental featuresaffectstress

0 comments

The pith

Vision language models can quantify the visual exposome from everyday photos and link it to mental health outcomes like stress and mood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that the visual surroundings captured in personal photographs can be objectively rated by AI to show measurable associations with mental health. It pairs real-time photo collection with vision language models to assess features such as scene greenness, demonstrating that these ratings predict immediate affect and longer-term stress in line with prior benchmarks. A large-scale literature mining step identifies nearly 1000 environmental elements tied to mental health, allowing the models to extract context ratings from images where up to a third correlate with the outcomes. This matters because existing methods use coarse location data or subjective reports that miss the actual first-person visual experience of daily life.

Core claim

By processing 2674 participant-generated photographs, VLM-derived greenness estimates robustly predicted momentary affect and chronic stress, matching established benchmarks. A semi-autonomous LLM pipeline mined over seven million publications to derive nearly 1000 environmental features linked to mental health; when these features were rated from real-world imagery, up to 33 percent of the VLM context ratings showed significant correlations with affect and stress measures. This work sets out a scalable objective approach to visual exposomics that decodes associations between the visible world and mental health.

What carries the argument

Vision language models applied to participant photographs to generate semantic ratings of visual context, augmented by a semi-autonomous LLM pipeline that extracts empirically linked environmental features from scientific literature.

If this is right

Greenness visible in daily scenes can be treated as a predictor of immediate emotional state and chronic stress.
Nearly 1000 literature-derived environmental features become assessable from ordinary images for their mental health relevance.
High-throughput objective analysis of personal visual environments replaces reliance on geospatial proxies or self-reports.
A scalable paradigm emerges for linking specific visible features to mental health across large numbers of people and settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Phone-based photo analysis could support personal tools that flag environments associated with better or worse mood.
The approach could extend to other outcomes such as physical activity levels or sleep quality by applying the same feature ratings.
Urban planning might use quantified visual metrics from many users to evaluate and improve public spaces for psychological effects.

Load-bearing premise

Participant photographs accurately capture first-person daily visual context and VLM-derived semantic ratings validly reflect environmental features that are relevant to mental health outcomes.

What would settle it

A replication study in which VLM greenness ratings from new participant photos show no correlation with independent greenness measures or with affect and stress scores would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.03863 by (2) TU Dresden, 3, (3) University Hospital Carl Gustav Carus Dresden, 4), (4) National Center for Tumor Diseases Heidelberg), Andreas R. Schwerdtfeger (1), Christian Rominger (1), Dimitri Khudyakow (2), Elizabeth A. M. Michels (2), Fabian Wolf (2), Jakob Nikolas Kather (2, Magdalena Katharina Wekenborg (2) ((1) University of Graz, Malay Gaherwar Singh (2).

**Figure 1.** Figure 1: Fig.1: Framework for first view at source ↗

**Figure 2.** Figure 2: Fig.2: Greenness view at source ↗

**Figure 3.** Figure 3: Fig.3: Associations between the literature view at source ↗

read the original abstract

The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a pipeline coupling ecological momentary assessment with vision-language models (VLMs) to quantify semantic features of the visual exposome from 2674 participant-generated photographs. VLM-derived greenness scores are shown to robustly predict momentary affect and chronic stress in line with prior benchmarks. An LLM-based literature-mining pipeline extracts nearly 1000 environmental features previously linked to mental health; when these features are rated by VLMs on the same images, up to 33% of the ratings correlate significantly with affect and stress measures. The work positions this approach as a scalable, objective paradigm for visual exposomics.

Significance. If the reported associations survive appropriate statistical controls, the study offers a genuinely new, high-throughput route to operationalize the visual component of the exposome. The greenness result is anchored to external benchmarks and therefore provides a credible proof-of-concept; the broader feature set, if validated, would enable systematic discovery of visual correlates of mental health without reliance on coarse geospatial or self-report data. The combination of VLM image understanding with LLM-driven literature synthesis is technically timely and could be adopted by exposome and environmental-psychology researchers.

major comments (3)

[Results and Methods (statistical analysis)] Results (paragraph reporting the 33% figure) and Methods (statistical analysis subsection): the manuscript states that up to 33% of the ~1000 VLM-derived context ratings showed significant correlations with affect and stress at a nominal p<0.05 threshold. No mention is made of family-wise error rate control, false-discovery-rate adjustment, or permutation testing across the full feature set. Under the global null, ~50 false positives are expected; without correction the 33% figure is consistent with noise and does not support the claim that the pipeline yields a scalable set of replicable visual exposome markers. Greenness is a single pre-specified test and is therefore exempt, but the multi-feature claim is load-bearing for the central paradigm argument.
[Methods (VLM rating and validation)] Methods (VLM rating and validation subsection): the paper provides no quantitative validation of VLM semantic ratings against human raters for the 1000 literature-derived features (only greenness is benchmarked). Without inter-rater reliability metrics or a held-out human-annotated subset, it is unclear whether the reported correlations reflect true environmental features or VLM-specific biases in prompt interpretation.
[Methods (participant and image acquisition)] Methods (participant and image acquisition subsection): the manuscript does not report participant demographics, inclusion/exclusion criteria, or the protocol used to select and upload the 2674 photographs. These details are required to evaluate selection bias and to determine whether the correlations generalize beyond the sampled population and photo-taking behavior.

minor comments (2)

[Abstract] Abstract: the phrase 'up to 33 percent' is imprecise; the exact number of features tested and the precise percentage that survived any correction should be stated.
[Methods and Figure legends] Figure legends and Methods: clarify the exact VLM and LLM model versions, temperature settings, and prompt templates used, as these choices can materially affect rating distributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful and constructive comments, which have helped us identify key areas to improve the rigor and transparency of our study. We provide point-by-point responses below, outlining the revisions we intend to make to address each concern.

read point-by-point responses

Referee: [Results and Methods (statistical analysis)] Results (paragraph reporting the 33% figure) and Methods (statistical analysis subsection): the manuscript states that up to 33% of the ~1000 VLM-derived context ratings showed significant correlations with affect and stress at a nominal p<0.05 threshold. No mention is made of family-wise error rate control, false-discovery-rate adjustment, or permutation testing across the full feature set. Under the global null, ~50 false positives are expected; without correction the 33% figure is consistent with noise and does not support the claim that the pipeline yields a scalable set of replicable visual exposome markers. Greenness is a single pre-specified test and is therefore exempt, but the multi-feature claim is load-bearing for the central paradigm argument.

Authors: We agree with the referee that appropriate correction for multiple comparisons is essential to substantiate the claim regarding the broader feature set. In the revised manuscript, we will apply the Benjamini-Hochberg false discovery rate procedure to the p-values from the ~1000 correlations and report the number of features that survive correction at q < 0.05. We will also include results from permutation testing, where we randomly permute the affect and stress scores across participants 1000 times to generate a null distribution for the proportion of significant correlations. These additions will be detailed in the Methods and presented in the Results, allowing us to distinguish signal from noise more reliably while preserving the pre-specified greenness analysis. revision: yes
Referee: [Methods (VLM rating and validation)] Methods (VLM rating and validation subsection): the paper provides no quantitative validation of VLM semantic ratings against human raters for the 1000 literature-derived features (only greenness is benchmarked). Without inter-rater reliability metrics or a held-out human-annotated subset, it is unclear whether the reported correlations reflect true environmental features or VLM-specific biases in prompt interpretation.

Authors: We acknowledge the importance of validating the VLM ratings for the literature-derived features. Although the greenness score was validated against established benchmarks, we did not extend this to the full set of features. In the revised manuscript, we will add a new validation analysis based on a randomly selected subset of 50 features. Two independent human raters will score these features on a held-out set of 100 images, and we will compute inter-rater agreement metrics (e.g., intraclass correlation coefficients) as well as agreement between the VLM and the average human ratings. This will provide quantitative evidence regarding the fidelity of the VLM outputs to human perception of the visual environment. revision: yes
Referee: [Methods (participant and image acquisition)] Methods (participant and image acquisition subsection): the manuscript does not report participant demographics, inclusion/exclusion criteria, or the protocol used to select and upload the 2674 photographs. These details are required to evaluate selection bias and to determine whether the correlations generalize beyond the sampled population and photo-taking behavior.

Authors: We apologize for not including these details in the main text. The full protocol for participant recruitment, inclusion and exclusion criteria, demographic characteristics of the sample (including age, sex, ethnicity, and education level), and the specific instructions and technical procedures for photograph acquisition and upload via the ecological momentary assessment app are described in the Supplementary Information. In the revised manuscript, we will add a dedicated paragraph in the Methods section summarizing these aspects, including key sample statistics, to facilitate assessment of selection bias and generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is externally anchored

full rationale

The paper's core chain proceeds as follows: participant photos are rated by VLMs for greenness and for ~1000 literature-mined environmental features; these ratings are then correlated with independent EMA measures of affect and stress. Greenness results are explicitly stated to be consistent with external benchmarks rather than derived from them. The feature list is extracted from over seven million external publications (not self-citations), and the subsequent correlations constitute new empirical tests on the photo dataset rather than tautological outputs. No equations, fitted parameters, or uniqueness claims reduce the reported associations to the inputs by construction. The statistical threshold issue raised in the skeptic note is a separate validity concern and does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on validity of VLM as proxy for visual context and representativeness of photos; no explicit free parameters or invented entities stated in abstract.

axioms (1)

domain assumption VLM outputs provide valid, unbiased estimates of semantic environmental features such as greenness that are relevant to mental health.
Invoked when claiming VLM-derived greenness predicts affect and stress, and when applying context ratings.

pith-pipeline@v0.9.0 · 5558 in / 1419 out tokens · 64047 ms · 2026-05-07T04:07:02.831223+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 31 canonical work pages · 1 internal anchor

[1]

& Verguet, S

Arias, D., Saxena, S. & Verguet, S. Quantifying the global burden of mental disorders and their economic value. EClinicalMedicine 54, 101675; 10.1016/j.eclinm.2022.101675 (2022)

work page doi:10.1016/j.eclinm.2022.101675 2022
[2]

Xu, J. et al. Effects of urban living environments on mental health in adults. Nat Med 29, 1456–1467; 10.1038/s41591-023-02365-w (2023)

work page doi:10.1038/s41591-023-02365-w 2023
[3]

C., Hart, J

Fong, K. C., Hart, J. E. & James, P. A review of epidemiologic studies on greenness and health: Updated literature through 2017. Curr Environ Health Rep 5, 77–87; 10.1007/s40572-018-0179-y (2018)

work page doi:10.1007/s40572-018-0179-y 2017
[4]

& Mote, J

Tran, I., Sabol, O. & Mote, J. The relationship between greenspace exposure and psychopathology symptoms: A systematic review. Biol Psychiatry Glob Open Sci 2, 206–222; 10.1016/j.bpsgos.2022.01.004 (2022)

work page doi:10.1016/j.bpsgos.2022.01.004 2022
[5]

Li, F. et al. Global association of greenness exposure with risk of nervous system disease: A systematic review and meta-analysis. Sci Total Environ 877, 162773; 10.1016/j.scitotenv.2023.162773 (2023)

work page doi:10.1016/j.scitotenv.2023.162773 2023
[6]

Green cities for better health

O’Leary, K. Green cities for better health. Nat Med; 10.1038/d41591-025-00001-3 (2025)

work page doi:10.1038/d41591-025-00001-3 2025
[7]

Hunter, R. F. et al. Integrating accelerometry, GPS, GIS and molecular data to investigate mechanistic pathways of the urban environmental exposome and cognitive outcomes in older adults: a longitudinal study protocol. BMJ Open 14, e085318; 10.1136/bmjopen-2024-085318 (2024). 28

work page doi:10.1136/bmjopen-2024-085318 2024
[8]

& Duncan, G

Cohen-Cline, H., Turkheimer, E. & Duncan, G. E. Access to green space, physical activity and mental health: A twin study. J Epidemiol Community Health 69, 523–529; 10.1136/jech-2014-204667 (2015)

work page doi:10.1136/jech-2014-204667 2014
[9]

Liu, Y., Kwan, M.-P. & Yu, C. The uncertain geographic context problem (UGCoP) in measuring people’s exposure to green space using the integrated 3S approach. Urban Forestry & Urban Greening 85, 127972; 10.1016/j.ufug.2023.127972 (2023)

work page doi:10.1016/j.ufug.2023.127972 2023
[10]

Tost, H. et al. Neural correlates of individual differences in affective benefit of real-life urban green space exposure. Nat Neurosci 22, 1389–1393; 10.1038/s41593-019-0451-y (2019)

work page doi:10.1038/s41593-019-0451-y 2019
[11]

Montone, R. A. et al. Exposome in ischaemic heart disease: beyond traditional risk factors. Eur Heart J 45, 419–438; 10.1093/eurheartj/ehae001 (2024)

work page doi:10.1093/eurheartj/ehae001 2024
[12]

Simonienko, K. et al. The impact of urban flower meadows on the well-being of city dwellers provides hints for planning biophilic green spaces. Sci Rep 15, 31981; 10.1038/s41598-025-16420-8 (2025)

work page doi:10.1038/s41598-025-16420-8 2025
[13]

& Shim, J

Jih, J., Nguyen, A., Woo, J., Ly, A. & Shim, J. K. Using photographs to understand the context of health: A novel two-step systematic process for coding visual data. Qual Health Res 33, 1049–1058; 10.1177/10497323231198196 (2023)

work page doi:10.1177/10497323231198196 2023
[14]

K., Smith, B

Padgett, D. K., Smith, B. T., Derejko, K.-S., Henwood, B. F. & Tiderington, E. A picture is worth . . . ? Photo elicitation interviewing with formerly homeless adults. Qual Health Res 23, 1435–1444; 10.1177/1049732313507752 (2013)

work page doi:10.1177/1049732313507752 2013
[15]

& Lesser, I

Ritondo, T., Bean, C. & Lesser, I. Pictures and processes: The use of autophotography to illustrate the experience of physical activity engagement in motherhood. Methods Psychol 10, 100139; 10.1016/j.metip.2024.100139 (2024). 29

work page doi:10.1016/j.metip.2024.100139 2024
[16]

Zhou, X. et al. Vision language models in autonomous driving: A survey and outlook. IEEE Trans Intell Veh; 10.1109/TIV.2024.3402136 (2024)

work page doi:10.1109/tiv.2024.3402136 2024
[17]

Tian, X. et al. DriveVLM: The convergence of autonomous driving and large vision- language models. arXiv preprint arXiv:2402.12289 (2024)

work page internal anchor Pith review arXiv 2024
[18]

Shiffman, S., Stone, A. A. & Hufford, M. R. Ecological momentary assessment. Annu Rev Clin Psychol 4, 1–32; 10.1146/annurev.clinpsy.3.022806.091415 (2008)

work page doi:10.1146/annurev.clinpsy.3.022806.091415 2008
[19]

Spano, G. et al. Objective greenness, connectedness to nature and sunlight levels towards perceived restorativeness in urban nature. Sci Rep 13, 18192; 10.1038/s41598- 023-45604-3 (2023)

work page doi:10.1038/s41598- 2023
[20]

Li, C., Sun, C., Sun, M., Yuan, Y. & Li, P. Effects of brightness levels on stress recovery when viewing a virtual reality forest with simulated natural light. Urban For Urban Green 56, 126865; 10.1016/j.ufug.2020.126865 (2020)

work page doi:10.1016/j.ufug.2020.126865 2020
[21]

Patterson, Baptiste Barbot, James Lloyd-Cox, and Roger E

Lewetz, D. & Stieger, S. ESMira: A decentralized open-source application for collecting experience sampling data. Behav Res Methods 56, 4421–4434; 10.3758/s13428-023- 02194-2 (2023)

work page doi:10.3758/s13428-023- 2023
[22]

W., Reichert, M., Tost, H

Ebner-Priemer, U. W., Reichert, M., Tost, H. & Meyer-Lindenberg, A. Wearables zum kontextgesteuerten Assessment in der Psychiatrie. Nervenarzt 90, 1207–1214; 10.1007/s00115-019-00815-w (2019)

work page doi:10.1007/s00115-019-00815-w 2019
[23]

Lu, Y. et al. Wearable data link urban green space to physical activity. Nat. Health 1, 67–77; 10.1038/s44360-025-00011-y (2025)

work page doi:10.1038/s44360-025-00011-y 2025
[24]

Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med 30, 2886–2896; 10.1038/s41591-024-03139-8 (2024). 30

work page doi:10.1038/s41591-024-03139-8 2024
[25]

Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat Med 31, 3394–3403; 10.1038/s41591-025-03888-0 (2025)

work page doi:10.1038/s41591-025-03888-0 2025
[26]

C., Holt, K

Roberts, M. C., Holt, K. E., Del Fiol, G., Baccarelli, A. A. & Allen, C. G. Precision public health in the era of genomics and big data. Nat Med 30, 1865–1873; 10.1038/s41591-024-03098-0 (2024)

work page doi:10.1038/s41591-024-03098-0 2024
[27]

W., Egloff, B., Kohlmann, C

Krohne, H. W., Egloff, B., Kohlmann, C. & Tausch, A. Untersuchungen mit einer deutschen Version der ‚Positive and Negative Affect Schedule‘ (PANAS). Diagnostica 42, 139–156 (1996)

1996
[28]

& Wilt, J

Revelle, W. & Wilt, J. Analyzing dynamic data: A tutorial. Personality and Individual Differences 136, 38–51; 10.1016/j.paid.2017.08.020 (2019)

work page doi:10.1016/j.paid.2017.08.020 2017
[29]

A global measure of perceived stress,

Cohen, S., Kamarck, T. & Mermelstein, R. A global measure of perceived stress. J Health Soc Behav 24, 385-396; 10.2307/2136404 (1983)

work page doi:10.2307/2136404 1983
[30]

M., Gaffey, A

Harris, K. M., Gaffey, A. E., Schwartz, J. E., Krantz, D. S. & Burg, M. M. The perceived stress scale as a measure of stress: Decomposing score variance in longitudinal behavioral medicine studies. Ann Behav Med 57, 846–854; 10.1093/abm/kaad015 (2023)

work page doi:10.1093/abm/kaad015 2023
[31]

E., Schönfelder, S., Domke-Wolf, M

Schneider, E. E., Schönfelder, S., Domke-Wolf, M. & Wessa, M. Measuring stress in clinical and nonclinical subjects using a German adaptation of the Perceived Stress Scale. Int J Clin Health Psychol 20, 173–181; 10.1016/j.ijchp.2020.03.004 (2020)

work page doi:10.1016/j.ijchp.2020.03.004 2020
[32]

R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, 2023)

R Core Team. R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, 2023)

2023
[33]

& Walker, S

Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Statistical Softw. 67, 1–48; 10.18637/jss.v067.i01 (2015). 31 Inventory of Supplemental Information Supplementary Table 1. Original and mean values of VLM -greenness-related image metrics over 5 runs for LLaMA4 VLM and Qwen3 VL Supplementary Table 2. Multile...

work page doi:10.18637/jss.v067.i01 2015