EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records
Pith reviewed 2026-05-21 12:19 UTC · model grok-4.3
The pith
EveryQuery pretrains EHR models on random task queries to deliver direct zero-shot clinical predictions without generating trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EveryQuery achieves zero-shot inference through task-conditioned pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts and enabling direct likelihood estimation for any task in the query space without finetuning, linear probing, or trajectory generation.
What carries the argument
Task-conditioned pretraining on randomly sampled query tasks paired with patient contexts, which trains the model to output outcome likelihoods directly from history plus query.
If this is right
- Outperforms an autoregressive baseline on 82 percent of 39 randomly sampled prediction tasks.
- Delivers a mean AUC improvement of +0.16 with 95 percent confidence interval [0.10, 0.22].
- Maintains its performance advantage on tasks explicitly held out from the pre-training distribution.
- Shows the largest gains on rare clinical events where trajectory sampling is statistically noisy.
Where Pith is reading between the lines
- Clinicians could pose ad-hoc questions to an EHR model without needing to retrain or fine-tune for each new prediction target.
- Extending the query language to handle logical disjunctions would likely close the gap on tasks such as readmission prediction.
- The direct-likelihood approach may remain computationally cheaper than trajectory sampling even as the number of possible clinical queries grows large.
Load-bearing premise
The structured query language used during pretraining can express the full space of clinically relevant tasks.
What would settle it
A direct test showing EveryQuery underperforms the autoregressive baseline on 30-day readmission or any other task that requires disjunctive reasoning over multiple diagnosis codes.
read the original abstract
Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EveryQuery, an EHR foundation model that achieves zero-shot clinical prediction via task-conditioned pretraining on randomly sampled combinations of structured queries and patient contexts from electronic health records. Instead of generating synthetic futures and aggregating trajectories, the model takes a patient's history plus a query specifying a clinical task and directly estimates the likelihood of the outcome in a single forward pass. On MIMIC-IV it reports outperformance versus an autoregressive baseline on 82% of 39 tasks (mean AUC gain +0.16, 95% CI [0.10,0.22]), with gains persisting on held-out tasks and being largest for rare events; the abstract itself flags underperformance on disjunctive tasks such as 30-day readmission.
Significance. If the empirical results hold after addressing the noted limitation, EveryQuery would offer a computationally lighter and directly promptable alternative to trajectory-sampling approaches for zero-shot EHR prediction. Concrete outperformance numbers with confidence intervals, explicit testing on held-out tasks, and the demonstration of gains on low-prevalence outcomes constitute clear strengths that could influence future foundation-model design in clinical ML.
major comments (1)
- [Abstract] Abstract: the central claim that EveryQuery enables zero-shot prediction for arbitrary clinical tasks within the query space is undercut by the explicit statement that the model underperforms on tasks requiring disjunctive reasoning over multiple codes (e.g., 30-day readmission). Because many standard clinical endpoints are disjunctive by nature, this limitation directly affects the asserted coverage of clinically relevant tasks and should be treated as a scope restriction rather than a minor caveat.
minor comments (1)
- The manuscript would benefit from an explicit definition or grammar of the structured query language (including supported operators and how disjunction is or is not represented) so that readers can assess the precise fragment of clinical tasks that fall inside the supported space.
Simulated Author's Rebuttal
We thank the referee for their insightful comments and constructive feedback on our manuscript. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that EveryQuery enables zero-shot prediction for arbitrary clinical tasks within the query space is undercut by the explicit statement that the model underperforms on tasks requiring disjunctive reasoning over multiple codes (e.g., 30-day readmission). Because many standard clinical endpoints are disjunctive by nature, this limitation directly affects the asserted coverage of clinically relevant tasks and should be treated as a scope restriction rather than a minor caveat.
Authors: We agree that the limitation on disjunctive tasks is important and should be presented as a scope restriction. The manuscript already explicitly states this limitation in the abstract to maintain transparency about the expressiveness of the current query language. In response to this comment, we will revise the abstract to more prominently frame the zero-shot prediction capability as applying to tasks within the query space that the model can effectively handle, while clearly positioning the underperformance on disjunctive reasoning tasks (e.g., 30-day readmission) as a scope limitation rather than a minor caveat. This will better reflect the coverage of clinically relevant tasks and highlight areas for future improvement in query expressiveness. We believe this revision addresses the concern without altering the core contributions. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical pretraining procedure for an EHR foundation model that directly trains on randomly sampled query-patient pairs to enable zero-shot likelihood estimation. No mathematical derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmark comparisons against an autoregressive baseline on MIMIC-IV tasks, including held-out ones, rather than any reduction of outputs to inputs by construction. The noted limitation on disjunctive queries is a scope issue for the query language, not a circularity in the method's justification.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.