arxiv: 2603.23987 · v2 · submitted 2026-03-25 · 💻 cs.LG

Recognition: no theorem link

Can we generate portable representations for clinical time series data using LLMs?

Zongliang Ji , Yifei Sun , Andre Amaral , Anna Goldenberg , Rahul G. Krishnan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords portable representationsclinical time seriesLLM summarizationhospital transferICU predictionsdistribution shiftpatient embeddingsfew-shot learning

0 comments

The pith

Frozen LLMs can turn irregular ICU time series into portable patient embeddings that transfer across hospitals with smaller performance drops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can produce patient representations from irregular clinical time series that work when moved between hospitals without retraining. It converts time series into natural language summaries with a frozen LLM, then embeds the summaries into fixed vectors using a frozen text model for use in forecasting and classification tasks. This method matches the accuracy of grid imputation, self-supervised learning, and time series foundation models within one hospital while showing reduced degradation on data from other hospitals in cohorts such as MIMIC-IV, HIRID, and PPICU. The results indicate that structured prompts stabilize predictions and that the embeddings support few-shot learning without increasing demographic recoverability.

Core claim

Mapping irregular ICU time series to concise natural language summaries with a frozen LLM, then embedding those summaries into fixed-length vectors with a frozen text model, yields patient representations that support downstream predictors. These vectors achieve competitive results on forecasting and classification tasks within a single hospital and exhibit smaller relative performance drops when applied to new hospitals compared to grid imputation, self-supervised representation learning, and time series foundation models.

What carries the argument

Two-stage pipeline that uses a frozen LLM to generate natural language summaries of irregular time series, followed by a frozen text embedding model to produce fixed-length vectors.

If this is right

Competitive in-distribution performance with grid imputation, self-supervised representation learning, and time series foundation models on forecasting and classification tasks.
Smaller relative performance drops when transferring to new hospitals across three cohorts.
Structured prompts reduce variance in predictive models without altering mean accuracy.
Improved results in few-shot learning scenarios.
No increase in recoverability of age or sex relative to baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals could generate embeddings locally from their own data while using a shared downstream predictor trained elsewhere.
The summarization approach might extend to other irregular time series domains such as sensor data in engineering or environmental monitoring.
Further prompt optimization tailored to specific clinical outcomes could improve summary quality and prediction reliability.
The method lowers engineering overhead for multi-site model deployment by reducing the need for site-specific feature engineering.

Load-bearing premise

The natural language summaries produced by the frozen LLM must faithfully preserve clinically relevant information from the original time series without systematic loss or bias that would affect downstream predictions.

What would settle it

A new hospital cohort where the embeddings show larger relative performance drops than grid imputation or self-supervised baselines on the same forecasting and classification tasks would falsify the portability advantage.

read the original abstract

Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM-generated text summaries from clinical time series hold up better on transfer than grid imputation or self-supervised baselines, but the fidelity of those summaries remains unverified.

read the letter

The paper's core result is that feeding irregular ICU time series (vitals, labs, interventions) into a frozen LLM to produce natural language summaries, then embedding those summaries, yields vectors that match in-distribution performance on forecasting and classification tasks while showing smaller relative drops when moving between MIMIC-IV, HIRID, and PPICU cohorts. The pipeline requires no fine-tuning and is presented as low-engineering, which is the practical angle they emphasize. They also report that structured prompts reduce variance without hurting mean accuracy, that the representations help few-shot settings, and that demographic recoverability for age and sex stays comparable to baselines. Those are concrete, useful observations for anyone trying to move models across hospitals. The approach is new in its specific combination of frozen LLM summarization plus text embeddings for this portability goal, building on but not duplicating prior LLM use in healthcare. What is missing is any direct test that the summaries actually preserve the clinically relevant temporal patterns and missingness structure. Because the LLM processes text rather than raw sequences, it could drop subtle dynamics or introduce its own biases, and the transfer gains might simply reflect task robustness rather than faithful representations. No reconstruction checks, clinician ratings of summary completeness, or ablation on information loss appear in the reported work. The abstract-level results make it hard to judge exact metrics, statistical tests, or sensitivity to prompt wording beyond the high-level claim. This is worth a serious referee for groups working on clinical deployment and distribution shift. The empirical cross-cohort setup is a step forward even if the central assumption needs tighter validation. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes mapping irregular ICU time series (vitals, labs, interventions) from cohorts including MIMIC-IV, HIRID, and PPICU to concise natural language summaries via a frozen LLM, followed by embedding with a frozen text model to produce fixed-length vectors for downstream forecasting and classification predictors. It reports that these representations achieve competitive in-distribution performance against grid imputation, self-supervised representation learning, and time series foundation models, with smaller relative drops under hospital transfer, improved few-shot learning, reduced variance from structured prompts, and no increase in demographic recoverability.

Significance. If the empirical results hold under fuller verification, the work offers a low-overhead route to portable clinical representations that could reduce retraining costs when deploying models across institutions; the emphasis on frozen components and prompt structure is a practical strength for reproducibility.

major comments (2)

[Methods] The portability claim rests on the assumption that frozen-LLM summaries preserve clinically relevant temporal and missingness patterns without systematic bias or loss. No direct fidelity check (reconstruction error, clinician-rated completeness, or comparison of summary-derived vs. raw-series information content) is described, which is load-bearing because downstream transfer gains could arise from predictor robustness rather than faithful representations.
[Experiments] Results on transfer performance: the smaller relative drops versus baselines are central to the main claim, yet the manuscript provides only high-level statements without per-task metrics, confidence intervals, or statistical tests for the cross-cohort differences, making it impossible to judge whether the improvement is reliable or practically meaningful.

minor comments (2)

[Prompt Design] Prompt variation study: while structured prompts are noted to reduce variance, an explicit ablation table showing component contributions (e.g., inclusion of missingness flags, temporal ordering) would clarify the source of stability.
[Few-shot Experiments] Few-shot results: the reported improvement is promising but would benefit from reporting the exact number of shots and the baseline few-shot performance for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas where additional evidence and detail can strengthen the manuscript's support for the portability claims. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Methods] The portability claim rests on the assumption that frozen-LLM summaries preserve clinically relevant temporal and missingness patterns without systematic bias or loss. No direct fidelity check (reconstruction error, clinician-rated completeness, or comparison of summary-derived vs. raw-series information content) is described, which is load-bearing because downstream transfer gains could arise from predictor robustness rather than faithful representations.

Authors: We agree that a direct fidelity assessment would provide stronger grounding for interpreting why the LLM-based representations transfer better. Our primary evidence for portability is the empirical reduction in performance degradation under hospital transfer, which we view as the most relevant metric for the deployment question posed in the paper. To address the concern, we will add a new subsection with two analyses: (1) a quantitative comparison of downstream predictor performance when trained on raw time-series features versus LLM summaries (to bound information loss), and (2) a small-scale clinician review of summary completeness for randomly sampled patients. These additions will be reported in the revised Methods and Results. revision: yes
Referee: [Experiments] Results on transfer performance: the smaller relative drops versus baselines are central to the main claim, yet the manuscript provides only high-level statements without per-task metrics, confidence intervals, or statistical tests for the cross-cohort differences, making it impossible to judge whether the improvement is reliable or practically meaningful.

Authors: We acknowledge that the current presentation of transfer results is too aggregated. In the revision we will expand the Experiments section with full per-task tables showing absolute and relative performance for every forecasting and classification task across all source-target cohort pairs. Each entry will include 95% bootstrap confidence intervals and the results of paired statistical tests (Wilcoxon signed-rank) comparing relative drops between our method and each baseline. We will also add a short discussion of effect sizes to clarify practical significance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method evaluated on held-out multi-cohort data

full rationale

The paper proposes and empirically tests a pipeline that converts irregular ICU time series into frozen-LLM natural-language summaries, embeds them with a frozen text model, and feeds the resulting vectors to downstream predictors. All claims rest on direct performance comparisons (in-distribution and transfer) against external baselines (grid imputation, self-supervised representations, time-series foundation models) across three distinct cohorts (MIMIC-IV, HIRID, PPICU) on forecasting and classification tasks. No equations, derivations, fitted-parameter renamings, or self-citation chains appear; the portability result is measured by relative performance drop on held-out hospital data and is therefore externally falsifiable. This matches the reader's assessment of score 1.0 with no self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM summaries capture predictive clinical signals from time series; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Frozen LLMs can produce clinically faithful natural language summaries from irregular time series data
Invoked when mapping time series to summaries as the core representation step.

pith-pipeline@v0.9.0 · 5563 in / 1206 out tokens · 35957 ms · 2026-05-15T00:47:23.988374+00:00 · methodology