LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models
Pith reviewed 2026-05-16 18:43 UTC · model grok-4.3
The pith
LENS aligns raw sensor streams with language models to generate mental health narratives from over 100,000 EMA-derived pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LENS constructs a dataset of more than 100,000 sensor-text QA pairs by transforming EMA responses on depression and anxiety symptoms into natural-language descriptions, then trains a patch-level encoder to project raw multimodal sensor signals directly into an LLM's representation space, enabling the generation of clinically grounded narratives that outperform baselines on NLP metrics and symptom-severity accuracy while receiving positive ratings from mental health professionals for being comprehensive and meaningful.
What carries the argument
The patch-level encoder, which projects raw sensor signals directly into an LLM's representation space to support native integration of time-series data with language models.
If this is right
- The model can process long-duration sensor streams that current LLMs cannot handle natively.
- Generated narratives achieve higher scores on both general NLP metrics and specific symptom-severity accuracy measures than strong baselines.
- Mental health professionals rate the outputs as comprehensive and clinically meaningful in direct user testing.
- The approach supplies a scalable route for LLMs to reason over raw behavioral signals and inform clinical decisions.
Where Pith is reading between the lines
- The same alignment technique could extend to other time-series health signals such as sleep patterns or physical activity for broader wellness applications.
- Real-time deployment on wearable devices might enable continuous narrative summaries that flag symptom changes without requiring manual EMA input.
- Privacy safeguards would need explicit design when scaling the sensor-to-text mapping to personal longitudinal data streams.
Load-bearing premise
Converting EMA responses into natural-language descriptions produces training pairs accurate enough to teach the encoder clinically valid links between sensor streams and symptom narratives.
What would settle it
A controlled comparison where LENS narratives receive lower expert ratings on clinical accuracy than human-written summaries of the same sensor data from the same participants.
Figures
read the original abstract
Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LENS, a framework that constructs a large-scale paired dataset by transforming EMA responses on depression and anxiety symptoms from 258 participants into natural-language descriptions, yielding over 100,000 sensor-text QA pairs. It then trains a patch-level encoder to project raw multimodal sensor streams directly into an LLM's representation space. The work reports that LENS outperforms strong baselines on standard NLP metrics and task-specific symptom-severity accuracy measures, and a user study with 13 mental-health professionals finds the generated narratives to be comprehensive and clinically meaningful.
Significance. If the EMA-to-narrative transformation step produces clinically valid and unbiased training pairs, LENS would offer a practical route for LLMs to ingest and reason over long-duration raw sensor data without intermediate feature engineering. The patch encoder and large paired dataset address a genuine scarcity in multimodal health sensing, and the clinician ratings provide an initial signal of downstream utility for narrative-based clinical interfaces. These elements could support more scalable, interpretable mental-health sensing applications if the core alignment is shown to be grounded.
major comments (2)
- [Dataset Construction] Dataset Construction section: The manuscript describes converting EMA symptom scores into natural-language descriptions to create the >100k training pairs but supplies no concrete details on the templates, prompting strategy, LLM used for generation, post-processing rules, or any validation (e.g., clinician review or inter-rater agreement) that the resulting texts faithfully capture symptom phenomenology without introducing artifacts or bias. Because the patch encoder is trained exclusively on these pairs and all reported gains in symptom-severity accuracy and narrative quality rest on this alignment, the absence of such documentation leaves the central technical claim unsupported.
- [Results] Results section (and abstract): The claims of outperformance on NLP metrics and task-specific symptom-severity accuracy are stated without accompanying quantitative values, named baselines, effect sizes, confidence intervals, or statistical tests. This omission prevents evaluation of whether the reported improvements are robust or clinically relevant, directly undermining the empirical contribution.
minor comments (1)
- [User Study] The user study is described only at a high level; adding the exact rating scales, inter-rater reliability, and any statistical comparison to baselines would strengthen the qualitative evidence without altering the core claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and quantitative results as outlined.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset Construction section: The manuscript describes converting EMA symptom scores into natural-language descriptions to create the >100k training pairs but supplies no concrete details on the templates, prompting strategy, LLM used for generation, post-processing rules, or any validation (e.g., clinician review or inter-rater agreement) that the resulting texts faithfully capture symptom phenomenology without introducing artifacts or bias. Because the patch encoder is trained exclusively on these pairs and all reported gains in symptom-severity accuracy and narrative quality rest on this alignment, the absence of such documentation leaves the central technical claim unsupported.
Authors: We agree that the Dataset Construction section requires more explicit documentation to substantiate the training pairs. In the revised manuscript, we will expand this section with the specific templates for converting EMA scores to natural language, the prompting strategy and LLM used for generation, post-processing rules applied, and validation procedures including any clinician review or inter-rater agreement metrics. These additions will directly address concerns about fidelity, bias, and support for the alignment claims. revision: yes
-
Referee: [Results] Results section (and abstract): The claims of outperformance on NLP metrics and task-specific symptom-severity accuracy are stated without accompanying quantitative values, named baselines, effect sizes, confidence intervals, or statistical tests. This omission prevents evaluation of whether the reported improvements are robust or clinically relevant, directly undermining the empirical contribution.
Authors: We concur that the Results section and abstract would be strengthened by including the specific quantitative details. In the revision, we will report the exact NLP metric values and symptom-severity accuracy scores, name the baselines compared against, include effect sizes, confidence intervals, and results of statistical tests. This will enable proper assessment of robustness and clinical relevance. revision: yes
Circularity Check
No significant circularity: empirical dataset construction and encoder training are self-contained
full rationale
The paper's core pipeline constructs >100k sensor-text pairs by transforming EMA symptom scores into natural-language descriptions, then trains a patch-level encoder to project raw multimodal streams into LLM space. All reported results (NLP metrics, symptom-severity accuracy, clinician ratings) follow from standard training and evaluation on these pairs plus a separate user study. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on the external validity of the EMA-to-text transformation rather than any internal reduction to the paper's own inputs. This is the expected non-circular outcome for an applied systems paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/DimensionForcing.lean (and ArithmeticFromLogic.lean)8-tick period forced by reality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the normalized sequence is divided into N=⌈T/k⌉ non-overlapping patches of width k, where k=8 for all streams in our experiments
-
IndisputableMonolith/Cost/FunctionalEquation.leanJ(x) = ½(x + x⁻¹) − 1 uniqueness and ratio symmetry echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
reversible value normalization … auxiliary statistics (µ, σ, min, max) are inserted into the textual prompt as metadata
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship
PULSE demonstrates that agentic LLM-based investigation of passive smartphone sensing data achieves balanced accuracies of 0.743 (with diary) and 0.713 (sensing-only) for predicting emotion regulation desire and inter...
Reference graph
Works this paper leans on
-
[1]
Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A
The phq-9: validity of a brief depression sever- ity measure.Journal of general internal medicine, 16(9):606–613. Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Ro- driguez, Daniel...
-
[2]
A picture is worth a thousand numbers: En- abling LLMs reason about time series via visualiza- tion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 7486– 7518, Albuquerque, New Mexico. Association for Computational Lin...
work page 2025
-
[3]
In Proceedings of the Conference on Health, Inference, and Learning (CHIL)
Time-LLM: Time series forecasting by repro- gramming large language models. Sehwan Moon, Aram Lee, Jeong Eun Kim, Hee-Ju Kang, Il-Seon Shin, Sung-Wan Kim, Jae-Min Kim, Min Jhon, and Ju-Wan Kim. 2025. Depressllm: In- terpretable domain-adapted language model for de- pression detection from real-world narratives. Subigya Nepal, Wenjun Liu, Arvind Pillai, We...
-
[4]
Frontiers in digital health, 3:662811
Wearable, environmental, and smartphone- based passive sensing for mental health monitoring. Frontiers in digital health, 3:662811. Dimitris Spathis and Fahim Kawsar. 2024. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models.Journal of the American Medical Informatics Association, 31(9):2151–2158....
work page 2024
-
[5]
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning.arXiv preprint. ArXiv:2412.03104 [cs]. Xuhai Xu, Xin Liu, Han Zhang, Weichen Wang, Subi- gya Nepal, and 1 others. 2023. Globem: Cross- dataset generalization of longitudinal human behav- ior modeling.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 6(4...
-
[6]
Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou
Qwen3 technical report. Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. Mental- lama: Interpretable mental health analysis on social media with large language models. InProceedings of the ACM Web Conference 2024, WWW ’24, page 4489–4500, New York, NY , USA. Association for Computing Machinery. Hyungjun Yoon,...
-
[7]
scale to assess symptoms over the preced- ing four-hour window. As shown in Table 2, the questionnaire covers 14 distinct categories, includ- ing core depression indicators like anhedonia and depressed mood, alongside physical and cognitive markers such as somatic discomfort, fatigue, and concentration. It also includes a negative event question to unders...
-
[8]
Anhedonia (Interest/Pleasure) In the past 4 hours, how much has the user shown little interest or pleasure in activities?
-
[9]
Depressed Mood In the past 4 hours, how much has the user appeared down, depressed, or hopeless?
-
[10]
Sleep Disturbance Last night, how much trouble did the user have with sleep?
-
[11]
Fatigue / Energy In the past 4 hours, how tired or low in energy has the user been?
-
[12]
Appetite Change In the past 4 hours, how much has the user shown a poor appetite or overeating?
-
[13]
Self-worth / Guilt In the past 4 hours, how much has the user felt bad about themselves?
-
[14]
Concentration In the past 4 hours, how much trouble has the user had concentrating?
-
[15]
Psychomotor Change In the past 4 hours, how much has the user been moving or speaking more slowly than usual?
-
[16]
Suicidal Ideation In the past 4 hours, how often has the user had thoughts of harming themselves or wishing to be dead?
-
[17]
Somatic Discomfort In the past 4 hours, how much has the user experienced headache, abdominal discomfort, or body aches?
-
[18]
Inverted Question An inverted question randomized from Q1, Q4, or Q7
-
[19]
Anxiety Arousal In the past 4 hours, how much has the user felt nervous, anxious, or on edge?
-
[20]
Uncontrollable Worry In the past 4 hours, how much has the user been unable to stop or control worrying?
-
[21]
Negative Event In the past 4 hours, did the user experience a negative event? If yes: How negative was the event? Overall Summary Please summarize the user’s overall mental and physical state in the past 4 hours, integrating mood, energy, sleep, appetite, concentration, and physical symptoms. Table 2:EMA Questions.Categories of depression- and anxiety-rel...
-
[22]
What is your professional role or background? (Options: Psychiatrist, Clinical Psychologist, Therapist, Other)
-
[23]
What is your highest degree held? (Options: High school/Diploma, Bachelor’s, Master’s, Doctorate/MD)
-
[24]
How familiar are you with the Patient Health Questionnaire-9 (PHQ-9) for depres- sion screening?
-
[25]
How familiar are you with the Generalized Anxiety Disorder Questionnaire (GAD-7) for anxiety screening? For Questions 3 and 4, the response options were: Not familiar at all, Slightly familiar, Moderately familiar, and Very familiar. The experts’ pre-survey responses are summarized in Table 7. Survey Flow.After completing the pre-survey questions, each ex...
-
[26]
Anhedonia (loss of interest or pleasure)
-
[27]
If a symptom is not present in a text, you must set both presence and severity to 0
OverallSeverity Severity scale is ordinal and must be inferred from the overall semantic strength of the description. If a symptom is not present in a text, you must set both presence and severity to 0. User Prompt Template Reference Summary: {reference} Prediction Summary: {prediction} Structured Response Schema SymptomEvaluation object with 14 symptom f...
-
[33]
Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Your task: Analyze the figure carefully and provide a clinical sum- mary of the user’s recent psychological and behavioral state, as if you were writing a short report based on PHQ- 9–related observati...
-
[34]
Heart rate (bpm): indicator of arousal, stress, and autonomic balance
-
[35]
Pseudoactigraphy: derived from wrist accelerom- eter signals, representing movement intensity and rest–activity rhythm
-
[36]
Steps per minute: reflects overall mobility and en- gagement in physical activity
-
[37]
Stress level: Garmin HRV-based estimation of phys- iological stress
-
[38]
GPS coordinates (longitude and latitude): capture spatial mobility and time spent in different environ- ments
-
[39]
Phone unlock status: number of unlock events per minute, representing cognitive or social engagement. Additional contextual features: {contextual_features} Question: {question} Your task: Analyze the figure and answer the question directly. Base your reasoning only on observable behavioral and physi- ological patterns plus the contextual features. Produce...
-
[46]
Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Task: Using only the provided textual data, produce a short clin- ical summary (about one concise paragraph) describing the user’s psychological and physical state over the last 4 hours. Your description should resemble a human-written mental- health assessment and cover...
-
[47]
Heart rate (1 reading per minute, length 1440) <ts></ts>
-
[48]
Pseudoactigraphy (accelerometer-based movement intensity × zero-crossing rate, length 480) <ts></ts>
-
[49]
Steps per minute (length 240)<ts></ts>
-
[50]
Stress level (length 240)<ts></ts>
-
[51]
GPS longitude (length 24)<ts></ts>
-
[52]
GPS latitude (length 24)<ts></ts>
-
[53]
• Refer only to the information implied by the time- series data; do not add external facts
Phone unlock status (binary 0/1 per minute, length 240)<ts></ts> {sleep_conversation} Question: {question} Answer Requirements: • Provide a concise, clinically grounded answer in one or two sentences. • Refer only to the information implied by the time- series data; do not add external facts. • If the data is insufficient, explicitly say so. Answer: Table...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.