EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

Gregory Kondas; Isaac Kohane; Liat Antwarg Friedman; Matthew McDermott; Payal Chandak

arxiv: 2603.07900 · v2 · pith:MMXPH5K7new · submitted 2026-03-09 · 💻 cs.AI

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

Payal Chandak , Gregory Kondas , Liat Antwarg Friedman , Isaac Kohane , Matthew McDermott This is my paper

Pith reviewed 2026-05-21 12:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords zero-shot clinical predictionelectronic health recordsfoundation modelstask-conditioned pretrainingclinical tasksautoregressive baselineMIMIC-IV

0 comments

The pith

EveryQuery pretrains EHR models on random task queries to deliver direct zero-shot clinical predictions without generating trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EveryQuery as a foundation model for electronic health records that learns to answer arbitrary clinical prediction questions directly from patient history. By conditioning pretraining on randomly sampled combinations of structured queries and patient contexts, the model estimates the likelihood of future outcomes in a single forward pass. This replaces the need to generate multiple synthetic patient futures and aggregate statistics over them. A sympathetic reader would care because the method shows higher accuracy than autoregressive sampling on most tested tasks and especially improves results for rare clinical events.

Core claim

EveryQuery achieves zero-shot inference through task-conditioned pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts and enabling direct likelihood estimation for any task in the query space without finetuning, linear probing, or trajectory generation.

What carries the argument

Task-conditioned pretraining on randomly sampled query tasks paired with patient contexts, which trains the model to output outcome likelihoods directly from history plus query.

If this is right

Outperforms an autoregressive baseline on 82 percent of 39 randomly sampled prediction tasks.
Delivers a mean AUC improvement of +0.16 with 95 percent confidence interval [0.10, 0.22].
Maintains its performance advantage on tasks explicitly held out from the pre-training distribution.
Shows the largest gains on rare clinical events where trajectory sampling is statistically noisy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinicians could pose ad-hoc questions to an EHR model without needing to retrain or fine-tune for each new prediction target.
Extending the query language to handle logical disjunctions would likely close the gap on tasks such as readmission prediction.
The direct-likelihood approach may remain computationally cheaper than trajectory sampling even as the number of possible clinical queries grows large.

Load-bearing premise

The structured query language used during pretraining can express the full space of clinically relevant tasks.

What would settle it

A direct test showing EveryQuery underperforms the autoregressive baseline on 30-day readmission or any other task that requires disjunctive reasoning over multiple diagnosis codes.

read the original abstract

Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EveryQuery pretrains EHR models for direct query likelihoods instead of trajectory sampling and shows gains on most tasks, but the query language cannot handle disjunctive cases like readmission.

read the letter

The main thing to know is that this paper trains an EHR model to take a patient history plus a structured query and output a direct probability for the clinical outcome in one pass. That replaces the usual autoregressive route of sampling many future trajectories and counting events, which is slow and noisy especially for rare outcomes. They pretrain by sampling random query-patient pairs so the model learns to answer arbitrary prompts without later fine-tuning or probing. On MIMIC-IV this beats the autoregressive baseline on 82% of 39 tasks with a mean AUC gain of 0.16 and the lift is biggest for low-prevalence events; the advantage also holds on tasks held out from pretraining. Those numbers are the concrete evidence they provide. The paper is clear that the current query language cannot express disjunctive conditions, so it underperforms on 30-day readmission and similar tasks that require any of several codes. That limitation is real and directly restricts how many standard clinical predictions can be handled zero-shot right now. The stress-test note flags exactly this coverage gap, and the abstract confirms it rather than glossing over it. For readers building practical zero-shot clinical tools or working on EHR foundation models, the work is worth a look because it tests a different inference setup with benchmark results that can be checked. The citation pattern is light and focused on the autoregressive baselines it compares against. I would send it for peer review because the empirical comparison is there, the limitation is stated plainly, and the central idea can be evaluated on its own terms even if the query expressiveness needs more work.

Referee Report

1 major / 1 minor

Summary. The paper introduces EveryQuery, an EHR foundation model that achieves zero-shot clinical prediction via task-conditioned pretraining on randomly sampled combinations of structured queries and patient contexts from electronic health records. Instead of generating synthetic futures and aggregating trajectories, the model takes a patient's history plus a query specifying a clinical task and directly estimates the likelihood of the outcome in a single forward pass. On MIMIC-IV it reports outperformance versus an autoregressive baseline on 82% of 39 tasks (mean AUC gain +0.16, 95% CI [0.10,0.22]), with gains persisting on held-out tasks and being largest for rare events; the abstract itself flags underperformance on disjunctive tasks such as 30-day readmission.

Significance. If the empirical results hold after addressing the noted limitation, EveryQuery would offer a computationally lighter and directly promptable alternative to trajectory-sampling approaches for zero-shot EHR prediction. Concrete outperformance numbers with confidence intervals, explicit testing on held-out tasks, and the demonstration of gains on low-prevalence outcomes constitute clear strengths that could influence future foundation-model design in clinical ML.

major comments (1)

[Abstract] Abstract: the central claim that EveryQuery enables zero-shot prediction for arbitrary clinical tasks within the query space is undercut by the explicit statement that the model underperforms on tasks requiring disjunctive reasoning over multiple codes (e.g., 30-day readmission). Because many standard clinical endpoints are disjunctive by nature, this limitation directly affects the asserted coverage of clinically relevant tasks and should be treated as a scope restriction rather than a minor caveat.

minor comments (1)

The manuscript would benefit from an explicit definition or grammar of the structured query language (including supported operators and how disjunction is or is not represented) so that readers can assess the precise fragment of clinical tasks that fall inside the supported space.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments and constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that EveryQuery enables zero-shot prediction for arbitrary clinical tasks within the query space is undercut by the explicit statement that the model underperforms on tasks requiring disjunctive reasoning over multiple codes (e.g., 30-day readmission). Because many standard clinical endpoints are disjunctive by nature, this limitation directly affects the asserted coverage of clinically relevant tasks and should be treated as a scope restriction rather than a minor caveat.

Authors: We agree that the limitation on disjunctive tasks is important and should be presented as a scope restriction. The manuscript already explicitly states this limitation in the abstract to maintain transparency about the expressiveness of the current query language. In response to this comment, we will revise the abstract to more prominently frame the zero-shot prediction capability as applying to tasks within the query space that the model can effectively handle, while clearly positioning the underperformance on disjunctive reasoning tasks (e.g., 30-day readmission) as a scope limitation rather than a minor caveat. This will better reflect the coverage of clinically relevant tasks and highlight areas for future improvement in query expressiveness. We believe this revision addresses the concern without altering the core contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical pretraining procedure for an EHR foundation model that directly trains on randomly sampled query-patient pairs to enable zero-shot likelihood estimation. No mathematical derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmark comparisons against an autoregressive baseline on MIMIC-IV tasks, including held-out ones, rather than any reduction of outputs to inputs by construction. The noted limitation on disjunctive queries is a scope issue for the query language, not a circularity in the method's justification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes the query language and pretraining distribution are representative of real clinical tasks.

pith-pipeline@v0.9.0 · 5844 in / 1186 out tokens · 37797 ms · 2026-05-21T12:19:24.026997+00:00 · methodology

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)