pith. machine review for the scientific record. sign in

arxiv: 2605.04071 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· q-bio.QM

Recognition: unknown

FlatASCEND: Autoregressive Clinical Sequence Generation with Continuous Time Prediction and Association-Based Pharmacological Testing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.QM
keywords autoregressive clinical sequencescontinuous time predictionpharmacological association testingpatient conditioningprompt shuffle ablationobservational dataincident user design
0
0 comments X

The pith

Patient-specific prefixes in an autoregressive clinical model amplify mechanistic drug effects while leaving confounding associations unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlatASCEND to generate multi-step clinical event sequences autoregressively, using flat composite tokens and a continuous-time prediction head. Its core test is whether conditioning on real patient prefixes makes the generated trajectories respond more strongly to known pharmacological mechanisms than to learned correlations from observational data. A prompt-shuffle ablation isolates this effect: mechanistic pairs such as steroid-glucose and diuretic-potassium show 2.0-2.2 times stronger association under patient conditioning, while a confounding pair such as insulin-glucose remains at 0.9 times. The model recovers the correct directional effect in 4 of 10 tested drug-outcome pairs on hospital records, reproduces treatment-context links in 2, and fails in 4, indicating it largely reproduces observational patterns rather than causal distinctions. Direct preference optimization on a shared outcome domain eliminates the correct associations entirely.

Core claim

FlatASCEND generates patient-conditioned clinical sequences whose responses to intervention tokens preserve known pharmacological associations, with a prompt-shuffle ablation demonstrating that patient-specific prefixes amplify mechanistic effects 2.0-2.2 times for steroid-to-glucose and diuretic-to-potassium while leaving confounding-driven associations at 0.9 times for insulin-to-glucose. On MIMIC-IV incident-user comparisons the model recovers correct mechanistic directions in 4 of 10 cases, reproduces context associations in 2, and produces incorrect directions in 4, a pattern the authors interpret as learned observational associations without causal separation.

What carries the argument

The prompt-shuffle ablation on patient-conditioned autoregressive generation, which isolates how specific patient history strengthens preservation of mechanistic pharmacological associations in the output trajectories.

If this is right

  • Generative clinical models can be assessed by whether patient conditioning selectively boosts known mechanistic associations rather than by distributional overlap alone.
  • Optimization methods that share outcome variables with the evaluation metric can erase correct pharmacological directions.
  • Short-horizon predictions in intensive-care settings remain more reliable than longer outpatient sequences.
  • Zero-shot application across hospitals degrades without further adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the amplification effect holds on interventional data, the same conditioning approach could support in-silico simulation of personalized treatment responses.
  • Adding explicit causal structure or external knowledge graphs might help the model move beyond observational associations.
  • The finding that reward optimization destroys correct links suggests caution when aligning such generators with surrogate objectives that overlap evaluation domains.

Load-bearing premise

That an incident-user design plus prompt shuffling on observational hospital records can separate true mechanistic pharmacological responses from residual confounding and learned correlations.

What would settle it

Run the same model and ablation on data from randomized controlled trials for the tested drug-outcome pairs and check whether the 2.0-2.2x amplification for mechanistic effects disappears or persists.

read the original abstract

Autoregressive models can predict clinical events, but generating patient-conditioned multi-step trajectories that respond to intervention tokens and testing whether those responses preserve known pharmacological associations has received limited attention. We present FlatASCEND, a 14.5M-parameter autoregressive clinical sequence model using flat composite tokens and a zero-inflated log-normal time head. Standard distributional metrics (Jaccard 0.889-0.954) do not distinguish FlatASCEND from trivial baselines; the model's value lies in conditional generation from patient-specific prefixes. A prompt-shuffle ablation shows patient-specific conditioning amplifies mechanistic pharmacological effects (2.0-2.2x for steroid to glucose, diuretic to potassium) while leaving confounding-driven associations unchanged (0.9x for insulin to glucose). An incident-user framework assesses directional consistency against prior pharmacological knowledge on MIMIC-IV (N=500 per comparison): 4/10 recover correct mechanistic directions, 2 reproduce treatment-context associations, 4 are incorrect (9/10 significant, Wilcoxon p<0.05). This pattern - partial recovery under residual confounding - is consistent with learned observational associations without causal distinction. Direct preference optimisation with surrogate reward destroys all correct associations (3/3 to 0/3), illustrating reward exploitation when reward and evaluation share an outcome domain. Generative evidence is strongest for short-horizon ICU data; outpatient temporal fidelity is weaker (median 10 vs 154 days on INSPECT), and zero-shot cross-site transfer degrades without adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FlatASCEND, a 14.5M-parameter autoregressive model for clinical sequence generation using flat composite tokens and a zero-inflated log-normal time head. It claims that while standard distributional metrics (Jaccard 0.889-0.954) fail to beat trivial baselines, the model's value is in patient-specific conditional generation; a prompt-shuffle ablation purportedly shows selective amplification of mechanistic pharmacological effects (2.0-2.2x for steroid-glucose and diuretic-potassium) versus no change for confounding associations (0.9x for insulin-glucose), and an incident-user design on MIMIC-IV recovers correct directions for 4/10 tested associations (9/10 significant by Wilcoxon), consistent with learned observational associations without causal distinction. DPO is shown to destroy correct associations.

Significance. If the prompt-shuffle ablation and incident-user design validly separate mechanistic responses from residual confounding and learned correlations, the work would offer a useful probe for how autoregressive clinical models encode conditional dependencies, with potential applications in trajectory simulation. The explicit acknowledgment that results reflect observational associations, the use of ablations and statistical tests, and the negative DPO result are strengths that demonstrate careful evaluation. However, the low association recovery rate and failure of unconditional metrics limit broader impact.

major comments (3)
  1. Prompt-shuffle ablation results: the claim of selective 2.0-2.2x amplification for mechanistic pairs (steroid to glucose, diuretic to potassium) versus 0.9x for the confounding pair (insulin to glucose) assumes the global shuffle isolates patient-specific conditioning without altering marginal distributions over time-varying confounders. Because the model is trained exclusively on observational MIMIC-IV trajectories, token transitions already entangle patient features, treatments, and outcomes; the differential effect may simply reflect stronger representation of mechanistic pairs in patient-specific marginals rather than true isolation of pharmacological mechanism. The paper itself concludes the pattern is 'consistent with learned observational associations without causal distinction,' making the selective-amplification framing load-bearing yet under-supported.
  2. Incident-user pharmacological testing framework (N=500 per comparison): only 4/10 associations recover the correct mechanistic direction, 4 are incorrect, and 2 reproduce treatment-context associations, despite 9/10 reaching Wilcoxon p<0.05. This low recovery rate, under the paper's own residual-confounding caveat, provides only weak evidence that the generative model captures pharmacological associations beyond training correlations; the framework's directional checks against external knowledge therefore do not strongly validate the model's mechanistic fidelity.
  3. Direct preference optimisation experiment: the finding that DPO reduces correct associations from 3/3 to 0/3 is presented as illustrating reward exploitation when reward and evaluation share an outcome domain. However, no additional controls (e.g., alternative rewards or out-of-domain evaluation) are reported to distinguish exploitation from general degradation of the generative distribution, weakening the illustrative claim.
minor comments (2)
  1. The abstract and results note that standard distributional metrics do not distinguish FlatASCEND from trivial baselines; explicit numerical comparisons to those baselines (e.g., unigram or Markov models) should be added to the main text or a table for clarity.
  2. Outpatient temporal fidelity is reported as weaker (median 10 vs 154 days on INSPECT) with degraded zero-shot cross-site transfer; a brief discussion of potential causes (e.g., data sparsity or tokenization) would help readers interpret the scope of the model's strengths.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important limitations in our interpretation of the prompt-shuffle ablation, the strength of evidence from the incident-user tests, and the controls in the DPO experiment. We address each point below with clarifications drawn directly from the manuscript's own caveats and propose targeted revisions to improve precision without overstating the results.

read point-by-point responses
  1. Referee: Prompt-shuffle ablation results: the claim of selective 2.0-2.2x amplification for mechanistic pairs (steroid to glucose, diuretic to potassium) versus 0.9x for the confounding pair (insulin to glucose) assumes the global shuffle isolates patient-specific conditioning without altering marginal distributions over time-varying confounders. ... making the selective-amplification framing load-bearing yet under-supported.

    Authors: We agree that the shuffle ablation cannot isolate causal pharmacological mechanisms, as the model is trained solely on observational MIMIC-IV data where token transitions already entangle patient features, treatments, and outcomes. The manuscript explicitly concludes that the observed pattern is 'consistent with learned observational associations without causal distinction.' The differential effect (amplification for the two mechanistic pairs, no change for the confounding pair) is presented only as evidence that patient-specific prefixes produce non-uniform shifts relative to shuffled prefixes, not as proof of mechanism isolation. To address the concern that the framing may be load-bearing, we will revise the relevant section and abstract to replace 'selective amplification of mechanistic pharmacological effects' with 'differential response to patient-specific conditioning for certain associations,' while retaining the quantitative results and the explicit non-causal caveat. This is a partial revision focused on language precision. revision: partial

  2. Referee: Incident-user pharmacological testing framework (N=500 per comparison): only 4/10 associations recover the correct mechanistic direction, 4 are incorrect, and 2 reproduce treatment-context associations, despite 9/10 reaching Wilcoxon p<0.05. This low recovery rate, under the paper's own residual-confounding caveat, provides only weak evidence that the generative model captures pharmacological associations beyond training correlations.

    Authors: We report the exact recovery statistics (4/10 correct mechanistic directions, 4 incorrect, 2 treatment-context) and the Wilcoxon significance (9/10) in the manuscript, and we frame the entire experiment as showing only 'partial recovery under residual confounding' that remains 'consistent with learned observational associations without causal distinction.' The test is not offered as strong validation of mechanistic fidelity but as a directional probe against external pharmacological knowledge on the same observational data source. The low correct-direction rate is therefore not an unacknowledged weakness but the central reported finding. We will add a sentence in the discussion explicitly noting that the mixed directional results limit claims about fidelity beyond correlations. This is a partial revision. revision: partial

  3. Referee: Direct preference optimisation experiment: the finding that DPO reduces correct associations from 3/3 to 0/3 is presented as illustrating reward exploitation when reward and evaluation share an outcome domain. However, no additional controls (e.g., alternative rewards or out-of-domain evaluation) are reported to distinguish exploitation from general degradation of the generative distribution, weakening the illustrative claim.

    Authors: The DPO result is included as a negative finding to illustrate the risk that a surrogate reward defined on the same outcome domain can eliminate previously observed correct associations. We acknowledge that, without controls such as alternative reward formulations or out-of-domain evaluation metrics, the experiment cannot rigorously separate targeted exploitation from broader degradation of the generative distribution. We will revise the text to describe the result as 'an illustrative case of association loss under in-domain reward optimization' and add a brief note that additional controls would be required to confirm exploitation as the mechanism. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reports empirical results from training an autoregressive model on MIMIC-IV trajectories and measuring differential amplification in a prompt-shuffle ablation plus directional consistency against external pharmacological knowledge. No equations, fitted parameters, or self-citations are presented that reduce the central claims (e.g., the 2.0-2.2x vs 0.9x contrast) to the training inputs by construction. The ablation compares conditioned versus shuffled prefixes within the same trained model; the resulting ratios are observed statistics on generated sequences rather than identities or renamed fits. The manuscript explicitly qualifies its findings as consistent with learned observational associations, confirming the evaluation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that patient-specific conditioning amplifies mechanistic effects rests on the domain assumption that prompt shuffling isolates patient context from general associations and that incident-user subsets allow directional comparison to prior pharmacological knowledge.

axioms (1)
  • domain assumption Observational MIMIC-IV data with incident-user design permits assessment of directional consistency against prior pharmacological knowledge
    Invoked in the incident-user framework section of the abstract to interpret the 4/10 correct recoveries.

pith-pipeline@v0.9.0 · 5584 in / 1405 out tokens · 84479 ms · 2026-05-10T15:07:44.298828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Y. Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ramakrishnan, D. Canoy, Y. Zhu, K. Rahimi, and G. Salimi-Khorshidi. BEHRT : Transformer for electronic health records. Scientific Reports, 10(1):1--12, 2020

  2. [2]

    Rasmy, Y

    L. Rasmy, Y. Xiang, Z. Xie, C. Tao, and D. Zhi. Med-BERT : pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Medicine, 4(1):1--13, 2021

  3. [3]

    L. L. Guo, E. Steinberg, S. L. Fleming, J. Posada, J. Lemmon, S. R. Pfohl, N. H. Shah, J. Fries, and L. Sung. EHR foundation models improve robustness in the presence of temporal distribution shift. arXiv preprint arXiv:2204.13992, 2022

  4. [4]

    doi:10.48550/arXiv.2301.03150 , abstract =

    E. Steinberg, J. Fries, Y. Xu, and N. H. Shah. MOTOR : A time-to-event foundation model for structured medical records. arXiv preprint arXiv:2301.03150, 2023

  5. [5]

    Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

    H. Rajamohan, Y. Yin, T. T. Zheng, . Scaling recurrence-aware foundation models for clinical records via next-visit prediction. arXiv preprint arXiv:2603.24562, 2026

  6. [6]

    Z. Chen, A. Pekis, and K. Brown. Building the EHR foundation model via next event prediction. arXiv preprint arXiv:2509.25591, 2025

  7. [7]

    Press, N

    O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022

  8. [8]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2024

  9. [9]

    S. P. Marso, G. H. Daniels, K. Poulter, . Liraglutide and cardiovascular outcomes in type 2 diabetes. New England Journal of Medicine, 375(4):311--322, 2016

  10. [10]

    J. A. Russell, K. R. Walley, J. Singer, . Vasopressin versus norepinephrine infusion in patients with septic shock. New England Journal of Medicine, 358(9):877--887, 2008

  11. [11]

    C. P. Cannon, E. Braunwald, C. H. McCabe, . Intensive versus moderate lipid lowering with statins after acute coronary syndromes. New England Journal of Medicine, 350(15):1495--1504, 2004

  12. [12]

    P. A. Poole-Wilson, K. Swedberg, J. G. F. Cleland, . Comparison of carvedilol and metoprolol on clinical outcomes in patients with chronic heart failure in the COMET trial. The Lancet, 362(9377):7--13, 2003

  13. [13]

    J. D. Truwit, G. R. Bernard, J. Steingrub, . Rosuvastatin for sepsis-associated acute respiratory distress syndrome. New England Journal of Medicine, 370(23):2191--2200, 2014

  14. [14]

    A. E. W. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, B. Moody, B. Gow, L. Lehman, L. A. Celi, and R. G. Mark. MIMIC-IV , a freely accessible electronic health record dataset. Scientific Data, 10(1):1--9, 2023

  15. [15]

    T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data, 5(1):1--13, 2018

  16. [16]

    A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. PhysioBank , PhysioToolkit , and PhysioNet : Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215--e220, 2000

  17. [17]

    18.Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

    Z. Kraljevic, A. Bean, A. Shek, R. Bendayan, J. Teo, and R. Dobson. Foresight---generative pretrained transformer (GPT) for modelling of patient timelines using electronic health records. arXiv preprint arXiv:2212.08072, 2024

  18. [18]

    Sainsbury and A

    C. Sainsbury and A. Karwath. ASCENDgpt : a phenotype-aware transformer for cardiovascular risk prediction. arXiv preprint arXiv:2509.04485, 2025

  19. [19]

    M. A. Hern \'a n and J. M. Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology, 183(8):758--764, 2016

  20. [20]

    J. M. Robins, M. A. Hern \'a n, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5):550--560, 2000