LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Amanda C Collins; Andrew Campbell; Arvind Pillai; Daniel M Mackin; Michael V Heinz; Nicholas C Jacobson; Subigya Nepal; Tess Z Griffin; Wenxuan Xu

arxiv: 2512.23025 · v2 · submitted 2025-12-28 · 💻 cs.CL · cs.AI

LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Wenxuan Xu , Arvind Pillai , Subigya Nepal , Amanda C Collins , Daniel M Mackin , Michael V Heinz , Tess Z Griffin , Nicholas C Jacobson

show 1 more author

Andrew Campbell

This is my paper

Pith reviewed 2026-05-16 18:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal sensinglanguage modelsmental health narrativesecological momentary assessmentsensor-text alignmentdepression and anxietypatch-level encoder

0 comments

The pith

LENS aligns raw sensor streams with language models to generate mental health narratives from over 100,000 EMA-derived pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LENS as a way to turn numerical time-series sensor data into natural language descriptions of depression and anxiety symptoms. It first converts Ecological Momentary Assessment responses into text to build a large paired dataset, then trains a patch-level encoder that maps raw sensor signals straight into an LLM's space. This setup lets the model produce narratives that score better than baselines on standard metrics and symptom accuracy measures. A study with mental health professionals found the outputs comprehensive and clinically relevant, suggesting a route for LLMs to interpret behavioral signals directly.

Core claim

LENS constructs a dataset of more than 100,000 sensor-text QA pairs by transforming EMA responses on depression and anxiety symptoms into natural-language descriptions, then trains a patch-level encoder to project raw multimodal sensor signals directly into an LLM's representation space, enabling the generation of clinically grounded narratives that outperform baselines on NLP metrics and symptom-severity accuracy while receiving positive ratings from mental health professionals for being comprehensive and meaningful.

What carries the argument

The patch-level encoder, which projects raw sensor signals directly into an LLM's representation space to support native integration of time-series data with language models.

If this is right

The model can process long-duration sensor streams that current LLMs cannot handle natively.
Generated narratives achieve higher scores on both general NLP metrics and specific symptom-severity accuracy measures than strong baselines.
Mental health professionals rate the outputs as comprehensive and clinically meaningful in direct user testing.
The approach supplies a scalable route for LLMs to reason over raw behavioral signals and inform clinical decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment technique could extend to other time-series health signals such as sleep patterns or physical activity for broader wellness applications.
Real-time deployment on wearable devices might enable continuous narrative summaries that flag symptom changes without requiring manual EMA input.
Privacy safeguards would need explicit design when scaling the sensor-to-text mapping to personal longitudinal data streams.

Load-bearing premise

Converting EMA responses into natural-language descriptions produces training pairs accurate enough to teach the encoder clinically valid links between sensor streams and symptom narratives.

What would settle it

A controlled comparison where LENS narratives receive lower expert ratings on clinical accuracy than human-written summaries of the same sensor data from the same participants.

Figures

Figures reproduced from arXiv: 2512.23025 by Amanda C Collins, Andrew Campbell, Arvind Pillai, Daniel M Mackin, Michael V Heinz, Nicholas C Jacobson, Subigya Nepal, Tess Z Griffin, Wenxuan Xu.

**Figure 1.** Figure 1: Illustration of the LENS idea. Mobile and wearable sensing signals, combined with a question, are passed to LENS, which produces a natural-language description. Clinicians can then view an interpretable snapshot of the user’s mental state instead of raw sensor streams. high burden on clinicians, dependence on retrospective self-reports, and reduced ecological validity because they are administered in con… view at source ↗

**Figure 2.** Figure 2: Overview of the narrative synthesis pipeline. EMA questions and responses are mapped to templates populated with corresponding frequency phrases. Subsequently, an LLM refines the text at both the item and summary levels (concatenated narratives) to enhance fluency and lexical diversity. high-quality QA datasets from self-reported EMA responses to serve as ground truth for model training and evaluation. Be… view at source ↗

**Figure 3.** Figure 3: LENS dataset construction pipeline. EMA responses are first converted into item-level and summary template narratives, which are then rewritten into fluent, enhanced narratives using Qwen2.5-14B. A multi-agent LLM-as-a-judge system conducts automatic quality control, routing failed cases back for regeneration until they satisfy all criteria. The final accepted narratives are then paired with paraphrased qu… view at source ↗

**Figure 4.** Figure 4: LENS Architecture. The model accepts multimodal inputs consisting of description text (e.g., "Heart rate..."), instruction text (e.g., "Summarize the user’s current mental well-being?") and raw time-series sensor streams (e.g., heart rate). The text is processed by a frozen LLM text embedder (f emb ϕ ), while time-series data is encoded by a trainable patch-based encoder (fθ). The resulting embeddings are … view at source ↗

**Figure 5.** Figure 5: User Study: Comprehensiveness. Expert ratings compare how many symptoms each model’s narrative successfully covers. From a clinical perspective, LENS achieves the highest alignment with ground-truth diagnoses. In summary-level generation, it attains a Symptom Coverage score of 0.801 and a Presence Alignment score of 0.601. This indicates that LENS is more capable of capturing the full spectrum of patient… view at source ↗

**Figure 7.** Figure 7: User Study: Clinical Utility & Language Cohesion. Comparing expert ratings indicating the usefulness of the narratives and the cohesiveness of the language. strates superior computational efficiency, requiring approximately 930 tokens per sample, a reduction of roughly 94% compared to verbose text serialization and 4× relative to vision-based models ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Computational Efficiency Analysis. Comparison of token consumption across three modalities: Text (Qwen-2.5), Vision-Language (TSImage/Qwen2.5-VL), and Time-Series (LENS). (a) Mean prefill tokens per prompt. (b) Total dataset token consumption (in Millions) for Narrative and QA datasets. trained on 50% data achieved the highest coverage (0.823), slightly outperforming the full model (0.801) and the 10% m… view at source ↗

**Figure 9.** Figure 9: Qualitative examples for narrative QA (top) and single-question QA (bottom). Each example shows the full prompt context, the form of sensor input provided to the model, and the generated output. LENS directly consumes raw multivariate time-series via a patch-based time-series encoder. VLM-based baselines receive the same signals rendered as multi-panel plots, while text-based baselines process serialized n… view at source ↗

read the original abstract

Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LENS builds a large EMA-to-text dataset and a patch encoder for direct sensor-LLM alignment, but the clinical fidelity of the synthetic pairs is the untested hinge.

read the letter

The main thing to know is that LENS turns numerical EMA responses on depression and anxiety into natural-language descriptions to create over 100k sensor-text pairs from 258 participants, then trains a patch-level encoder that projects raw multimodal time-series straight into an LLM representation space for generating narratives. This is a concrete step past feature engineering toward letting models reason over continuous behavioral signals. The paper shows gains over baselines on NLP metrics and symptom-severity accuracy, plus a 13-person clinician study where the outputs looked comprehensive and meaningful. That combination of scale and direct alignment is the actual novelty here, and it is not just re-packaging prior multimodal sensing work. The user study adds a useful external check that the generated text passes a basic clinical sniff test. The soft spot is exactly the one the stress-test flags: everything downstream depends on the EMA-to-text step producing high-fidelity, unbiased training pairs. The abstract gives no details on the conversion templates, prompting, or validation of those texts, so it is impossible to tell whether the learned alignments reflect real symptom phenomenology or just artifacts from the synthetic data. Quantitative results also lack effect sizes, statistical tests, or clear baseline descriptions, which leaves the performance claims hard to weigh. This is for researchers working on multimodal health sensing and LLM interfaces for behavioral data. A reader who wants practical ideas for ingesting sensor streams without heavy preprocessing will find usable pieces. I would send it for peer review because the pipeline is new enough and the clinician feedback provides some grounding, even though the methods will need substantial expansion to support the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces LENS, a framework that constructs a large-scale paired dataset by transforming EMA responses on depression and anxiety symptoms from 258 participants into natural-language descriptions, yielding over 100,000 sensor-text QA pairs. It then trains a patch-level encoder to project raw multimodal sensor streams directly into an LLM's representation space. The work reports that LENS outperforms strong baselines on standard NLP metrics and task-specific symptom-severity accuracy measures, and a user study with 13 mental-health professionals finds the generated narratives to be comprehensive and clinically meaningful.

Significance. If the EMA-to-narrative transformation step produces clinically valid and unbiased training pairs, LENS would offer a practical route for LLMs to ingest and reason over long-duration raw sensor data without intermediate feature engineering. The patch encoder and large paired dataset address a genuine scarcity in multimodal health sensing, and the clinician ratings provide an initial signal of downstream utility for narrative-based clinical interfaces. These elements could support more scalable, interpretable mental-health sensing applications if the core alignment is shown to be grounded.

major comments (2)

[Dataset Construction] Dataset Construction section: The manuscript describes converting EMA symptom scores into natural-language descriptions to create the >100k training pairs but supplies no concrete details on the templates, prompting strategy, LLM used for generation, post-processing rules, or any validation (e.g., clinician review or inter-rater agreement) that the resulting texts faithfully capture symptom phenomenology without introducing artifacts or bias. Because the patch encoder is trained exclusively on these pairs and all reported gains in symptom-severity accuracy and narrative quality rest on this alignment, the absence of such documentation leaves the central technical claim unsupported.
[Results] Results section (and abstract): The claims of outperformance on NLP metrics and task-specific symptom-severity accuracy are stated without accompanying quantitative values, named baselines, effect sizes, confidence intervals, or statistical tests. This omission prevents evaluation of whether the reported improvements are robust or clinically relevant, directly undermining the empirical contribution.

minor comments (1)

[User Study] The user study is described only at a high level; adding the exact rating scales, inter-rater reliability, and any statistical comparison to baselines would strengthen the qualitative evidence without altering the core claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and quantitative results as outlined.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: The manuscript describes converting EMA symptom scores into natural-language descriptions to create the >100k training pairs but supplies no concrete details on the templates, prompting strategy, LLM used for generation, post-processing rules, or any validation (e.g., clinician review or inter-rater agreement) that the resulting texts faithfully capture symptom phenomenology without introducing artifacts or bias. Because the patch encoder is trained exclusively on these pairs and all reported gains in symptom-severity accuracy and narrative quality rest on this alignment, the absence of such documentation leaves the central technical claim unsupported.

Authors: We agree that the Dataset Construction section requires more explicit documentation to substantiate the training pairs. In the revised manuscript, we will expand this section with the specific templates for converting EMA scores to natural language, the prompting strategy and LLM used for generation, post-processing rules applied, and validation procedures including any clinician review or inter-rater agreement metrics. These additions will directly address concerns about fidelity, bias, and support for the alignment claims. revision: yes
Referee: [Results] Results section (and abstract): The claims of outperformance on NLP metrics and task-specific symptom-severity accuracy are stated without accompanying quantitative values, named baselines, effect sizes, confidence intervals, or statistical tests. This omission prevents evaluation of whether the reported improvements are robust or clinically relevant, directly undermining the empirical contribution.

Authors: We concur that the Results section and abstract would be strengthened by including the specific quantitative details. In the revision, we will report the exact NLP metric values and symptom-severity accuracy scores, name the baselines compared against, include effect sizes, confidence intervals, and results of statistical tests. This will enable proper assessment of robustness and clinical relevance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical dataset construction and encoder training are self-contained

full rationale

The paper's core pipeline constructs >100k sensor-text pairs by transforming EMA symptom scores into natural-language descriptions, then trains a patch-level encoder to project raw multimodal streams into LLM space. All reported results (NLP metrics, symptom-severity accuracy, clinician ratings) follow from standard training and evaluation on these pairs plus a separate user study. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on the external validity of the EMA-to-text transformation rather than any internal reduction to the paper's own inputs. This is the expected non-circular outcome for an applied systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the implicit assumption that EMA responses can be reliably turned into clinically accurate natural-language descriptions; no numerical constants or new physical entities are introduced.

pith-pipeline@v0.9.0 · 5529 in / 1220 out tokens · 42769 ms · 2026-05-16T18:43:25.261776+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean (and ArithmeticFromLogic.lean) 8-tick period forced by reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the normalized sequence is divided into N=⌈T/k⌉ non-overlapping patches of width k, where k=8 for all streams in our experiments
IndisputableMonolith/Cost/FunctionalEquation.lean J(x) = ½(x + x⁻¹) − 1 uniqueness and ratio symmetry echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

reversible value normalization … auxiliary statistics (µ, σ, min, max) are inserted into the textual prompt as metadata

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship
cs.HC 2026-05 unverdicted novelty 6.0

PULSE demonstrates that agentic LLM-based investigation of passive smartphone sensing data achieves balanced accuracies of 0.743 (with diary) and 0.713 (sensing-only) for predicting emotion regulation desire and inter...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper

[1]

Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A

The phq-9: validity of a brief depression sever- ity measure.Journal of general internal medicine, 16(9):606–613. Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Ro- driguez, Daniel...

work page arXiv 2025
[2]

A picture is worth a thousand numbers: En- abling LLMs reason about time series via visualiza- tion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 7486– 7518, Albuquerque, New Mexico. Association for Computational Lin...

work page 2025
[3]

In Proceedings of the Conference on Health, Inference, and Learning (CHIL)

Time-LLM: Time series forecasting by repro- gramming large language models. Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, and Ju-Wan Kim. 2025. Depressllm: In- terpretable domain-adapted language model for de- pression detection from real-world narratives. Subigya Nepal, Wenjun Liu, Arvind Pillai, We...

work page arXiv 2025
[4]

Frontiers in digital health, 3:662811

Wearable, environmental, and smartphone- based passive sensing for mental health monitoring. Frontiers in digital health, 3:662811. Dimitris Spathis and Fahim Kawsar. 2024. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models.Journal of the American Medical Informatics Association, 31(9):2151–2158....

work page 2024
[5]

Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning.arXiv preprint. ArXiv:2412.03104 [cs]. Xuhai Xu, Xin Liu, Han Zhang, Weichen Wang, Subi- gya Nepal, and 1 others. 2023. Globem: Cross- dataset generalization of longitudinal human behav- ior modeling.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 6(4...

work page arXiv 2023
[6]

Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou

Qwen3 technical report. Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. Mental- lama: Interpretable mental health analysis on social media with large language models. InProceedings of the ACM Web Conference 2024, WWW ’24, page 4489–4500, New York, NY , USA. Association for Computing Machinery. Hyungjun Yoon,...

work page arXiv 2024
[7]

scale to assess symptoms over the preced- ing four-hour window. As shown in Table 2, the questionnaire covers 14 distinct categories, includ- ing core depression indicators like anhedonia and depressed mood, alongside physical and cognitive markers such as somatic discomfort, fatigue, and concentration. It also includes a negative event question to unders...

work page
[8]

Anhedonia (Interest/Pleasure) In the past 4 hours, how much has the user shown little interest or pleasure in activities?

work page
[9]

Depressed Mood In the past 4 hours, how much has the user appeared down, depressed, or hopeless?

work page
[10]

Sleep Disturbance Last night, how much trouble did the user have with sleep?

work page
[11]

Fatigue / Energy In the past 4 hours, how tired or low in energy has the user been?

work page
[12]

Appetite Change In the past 4 hours, how much has the user shown a poor appetite or overeating?

work page
[13]

Self-worth / Guilt In the past 4 hours, how much has the user felt bad about themselves?

work page
[14]

Concentration In the past 4 hours, how much trouble has the user had concentrating?

work page
[15]

Psychomotor Change In the past 4 hours, how much has the user been moving or speaking more slowly than usual?

work page
[16]

Suicidal Ideation In the past 4 hours, how often has the user had thoughts of harming themselves or wishing to be dead?

work page
[17]

Somatic Discomfort In the past 4 hours, how much has the user experienced headache, abdominal discomfort, or body aches?

work page
[18]

Inverted Question An inverted question randomized from Q1, Q4, or Q7

work page
[19]

Anxiety Arousal In the past 4 hours, how much has the user felt nervous, anxious, or on edge?

work page
[20]

Uncontrollable Worry In the past 4 hours, how much has the user been unable to stop or control worrying?

work page
[21]

Table 2:EMA Questions.Categories of depression- and anxiety-related symptoms and their corresponding questions or statements

Negative Event In the past 4 hours, did the user experience a negative event? If yes: How negative was the event? Overall Summary Please summarize the user’s overall mental and physical state in the past 4 hours, integrating mood, energy, sleep, appetite, concentration, and physical symptoms. Table 2:EMA Questions.Categories of depression- and anxiety-rel...

work page arXiv
[22]

What is your professional role or background? (Options: Psychiatrist, Clinical Psychologist, Therapist, Other)

work page
[23]

What is your highest degree held? (Options: High school/Diploma, Bachelor’s, Master’s, Doctorate/MD)

work page
[24]

How familiar are you with the Patient Health Questionnaire-9 (PHQ-9) for depres- sion screening?

work page
[25]

scores": [...],

How familiar are you with the Generalized Anxiety Disorder Questionnaire (GAD-7) for anxiety screening? For Questions 3 and 4, the response options were: Not familiar at all, Slightly familiar, Moderately familiar, and Very familiar. The experts’ pre-survey responses are summarized in Table 7. Survey Flow.After completing the pre-survey questions, each ex...

work page arXiv
[26]

Anhedonia (loss of interest or pleasure)

work page
[27]

If a symptom is not present in a text, you must set both presence and severity to 0

OverallSeverity Severity scale is ordinal and must be inferred from the overall semantic strength of the description. If a symptom is not present in a text, you must set both presence and severity to 0. User Prompt Template Reference Summary: {reference} Prediction Summary: {prediction} Structured Response Schema SymptomEvaluation object with 14 symptom f...

work page
[33]

Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Your task: Analyze the figure carefully and provide a clinical sum- mary of the user’s recent psychological and behavioral state, as if you were writing a short report based on PHQ- 9–related observati...

work page
[34]

Heart rate (bpm): indicator of arousal, stress, and autonomic balance

work page
[35]

Pseudoactigraphy: derived from wrist accelerom- eter signals, representing movement intensity and rest–activity rhythm

work page
[36]

Steps per minute: reflects overall mobility and en- gagement in physical activity

work page
[37]

Stress level: Garmin HRV-based estimation of phys- iological stress

work page
[38]

GPS coordinates (longitude and latitude): capture spatial mobility and time spent in different environ- ments

work page
[39]

Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly

Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly. Base your reasoning only on observable behavioral and physi- ological patterns plus the contextual features. Produce...

work page
[46]

Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Task: Using only the provided textual data, produce a short clin- ical summary (about one concise paragraph) describing the user’s psychological and physical state over the last 4 hours. Your description should resemble a human-written mental- health assessment and cover...

work page
[47]

Heart rate (1 reading per minute, length 1440) <ts></ts>

work page
[48]

Pseudoactigraphy (accelerometer-based movement intensity × zero-crossing rate, length 480) <ts></ts>

work page
[49]

Steps per minute (length 240)<ts></ts>

work page
[50]

Stress level (length 240)<ts></ts>

work page
[51]

GPS longitude (length 24)<ts></ts>

work page
[52]

GPS latitude (length 24)<ts></ts>

work page
[53]

• Refer only to the information implied by the time- series data; do not add external facts

Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Question: {question} Answer Requirements: • Provide a concise, clinically grounded answer in one or two sentences. • Refer only to the information implied by the time- series data; do not add external facts. • If the data is insufficient, explicitly say so. Answer: Table...

work page

[1] [1]

Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A

The phq-9: validity of a brief depression sever- ity measure.Journal of general internal medicine, 16(9):606–613. Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Ro- driguez, Daniel...

work page arXiv 2025

[2] [2]

A picture is worth a thousand numbers: En- abling LLMs reason about time series via visualiza- tion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 7486– 7518, Albuquerque, New Mexico. Association for Computational Lin...

work page 2025

[3] [3]

In Proceedings of the Conference on Health, Inference, and Learning (CHIL)

Time-LLM: Time series forecasting by repro- gramming large language models. Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, and Ju-Wan Kim. 2025. Depressllm: In- terpretable domain-adapted language model for de- pression detection from real-world narratives. Subigya Nepal, Wenjun Liu, Arvind Pillai, We...

work page arXiv 2025

[4] [4]

Frontiers in digital health, 3:662811

Wearable, environmental, and smartphone- based passive sensing for mental health monitoring. Frontiers in digital health, 3:662811. Dimitris Spathis and Fahim Kawsar. 2024. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models.Journal of the American Medical Informatics Association, 31(9):2151–2158....

work page 2024

[5] [5]

Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning.arXiv preprint. ArXiv:2412.03104 [cs]. Xuhai Xu, Xin Liu, Han Zhang, Weichen Wang, Subi- gya Nepal, and 1 others. 2023. Globem: Cross- dataset generalization of longitudinal human behav- ior modeling.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 6(4...

work page arXiv 2023

[6] [6]

Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou

Qwen3 technical report. Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. Mental- lama: Interpretable mental health analysis on social media with large language models. InProceedings of the ACM Web Conference 2024, WWW ’24, page 4489–4500, New York, NY , USA. Association for Computing Machinery. Hyungjun Yoon,...

work page arXiv 2024

[7] [7]

scale to assess symptoms over the preced- ing four-hour window. As shown in Table 2, the questionnaire covers 14 distinct categories, includ- ing core depression indicators like anhedonia and depressed mood, alongside physical and cognitive markers such as somatic discomfort, fatigue, and concentration. It also includes a negative event question to unders...

work page

[8] [8]

Anhedonia (Interest/Pleasure) In the past 4 hours, how much has the user shown little interest or pleasure in activities?

work page

[9] [9]

Depressed Mood In the past 4 hours, how much has the user appeared down, depressed, or hopeless?

work page

[10] [10]

Sleep Disturbance Last night, how much trouble did the user have with sleep?

work page

[11] [11]

Fatigue / Energy In the past 4 hours, how tired or low in energy has the user been?

work page

[12] [12]

Appetite Change In the past 4 hours, how much has the user shown a poor appetite or overeating?

work page

[13] [13]

Self-worth / Guilt In the past 4 hours, how much has the user felt bad about themselves?

work page

[14] [14]

Concentration In the past 4 hours, how much trouble has the user had concentrating?

work page

[15] [15]

Psychomotor Change In the past 4 hours, how much has the user been moving or speaking more slowly than usual?

work page

[16] [16]

Suicidal Ideation In the past 4 hours, how often has the user had thoughts of harming themselves or wishing to be dead?

work page

[17] [17]

Somatic Discomfort In the past 4 hours, how much has the user experienced headache, abdominal discomfort, or body aches?

work page

[18] [18]

Inverted Question An inverted question randomized from Q1, Q4, or Q7

work page

[19] [19]

Anxiety Arousal In the past 4 hours, how much has the user felt nervous, anxious, or on edge?

work page

[20] [20]

Uncontrollable Worry In the past 4 hours, how much has the user been unable to stop or control worrying?

work page

[21] [21]

Table 2:EMA Questions.Categories of depression- and anxiety-related symptoms and their corresponding questions or statements

Negative Event In the past 4 hours, did the user experience a negative event? If yes: How negative was the event? Overall Summary Please summarize the user’s overall mental and physical state in the past 4 hours, integrating mood, energy, sleep, appetite, concentration, and physical symptoms. Table 2:EMA Questions.Categories of depression- and anxiety-rel...

work page arXiv

[22] [22]

What is your professional role or background? (Options: Psychiatrist, Clinical Psychologist, Therapist, Other)

work page

[23] [23]

What is your highest degree held? (Options: High school/Diploma, Bachelor’s, Master’s, Doctorate/MD)

work page

[24] [24]

How familiar are you with the Patient Health Questionnaire-9 (PHQ-9) for depres- sion screening?

work page

[25] [25]

scores": [...],

How familiar are you with the Generalized Anxiety Disorder Questionnaire (GAD-7) for anxiety screening? For Questions 3 and 4, the response options were: Not familiar at all, Slightly familiar, Moderately familiar, and Very familiar. The experts’ pre-survey responses are summarized in Table 7. Survey Flow.After completing the pre-survey questions, each ex...

work page arXiv

[26] [26]

Anhedonia (loss of interest or pleasure)

work page

[27] [27]

If a symptom is not present in a text, you must set both presence and severity to 0

OverallSeverity Severity scale is ordinal and must be inferred from the overall semantic strength of the description. If a symptom is not present in a text, you must set both presence and severity to 0. User Prompt Template Reference Summary: {reference} Prediction Summary: {prediction} Structured Response Schema SymptomEvaluation object with 14 symptom f...

work page

[28] [33]

Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Your task: Analyze the figure carefully and provide a clinical sum- mary of the user’s recent psychological and behavioral state, as if you were writing a short report based on PHQ- 9–related observati...

work page

[29] [34]

Heart rate (bpm): indicator of arousal, stress, and autonomic balance

work page

[30] [35]

Pseudoactigraphy: derived from wrist accelerom- eter signals, representing movement intensity and rest–activity rhythm

work page

[31] [36]

Steps per minute: reflects overall mobility and en- gagement in physical activity

work page

[32] [37]

Stress level: Garmin HRV-based estimation of phys- iological stress

work page

[33] [38]

GPS coordinates (longitude and latitude): capture spatial mobility and time spent in different environ- ments

work page

[34] [39]

Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly

Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly. Base your reasoning only on observable behavioral and physi- ological patterns plus the contextual features. Produce...

work page

[35] [46]

Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Task: Using only the provided textual data, produce a short clin- ical summary (about one concise paragraph) describing the user’s psychological and physical state over the last 4 hours. Your description should resemble a human-written mental- health assessment and cover...

work page

[36] [47]

Heart rate (1 reading per minute, length 1440) <ts></ts>

work page

[37] [48]

Pseudoactigraphy (accelerometer-based movement intensity × zero-crossing rate, length 480) <ts></ts>

work page

[38] [49]

Steps per minute (length 240)<ts></ts>

work page

[39] [50]

Stress level (length 240)<ts></ts>

work page

[40] [51]

GPS longitude (length 24)<ts></ts>

work page

[41] [52]

GPS latitude (length 24)<ts></ts>

work page

[42] [53]

• Refer only to the information implied by the time- series data; do not add external facts

Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Question: {question} Answer Requirements: • Provide a concise, clinically grounded answer in one or two sentences. • Refer only to the information implied by the time- series data; do not add external facts. • If the data is insufficient, explicitly say so. Answer: Table...

work page