arxiv: 2604.23054 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

Youze Zheng , Jianyou Wang , Yuhan Chen , Matthew Feng , Longtian Bao , Hanyuan Zhang , Maxim Khan , Aditya K. Sehgal

show 3 more authors

Christopher D. Rosin Umber Dube Ramamohan Paturi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords clinical trial outcome predictioncounterfactual reasoningbiomedical language modelscausal mechanism approximationsupervised fine-tuningreinforcement learning with verifiable rewardssmall language modelssynthetic reasoning traces

0 comments

The pith

DeepImagine trains small language models on successive counterfactual changes drawn from real clinical trials to improve outcome predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a training framework that teaches language models to reason about biomedical trials by repeatedly imagining how results would shift under isolated changes to attributes such as dosage, outcome measures, or geography. Real trial records supply both exact pairs for supervised fine-tuning and approximate pairs for reinforcement learning with rewards tied to prediction accuracy on downstream tasks. Synthetic reasoning traces are added to make the transitions causally plausible. The resulting models under 10 billion parameters are shown to outperform both untuned language models and traditional correlational predictors on forecasting clinical trial results. A reader cares because accurate mechanistic prediction could reduce the high failure rate of prospective trials and support more efficient biomedical research.

Core claim

DeepImagine approximates hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions such as dosage, outcome measures, study arms, geography, and other trial attributes. Natural and approximate counterfactual pairs from real clinical trials support supervised fine-tuning for strict pairs and reinforcement learning with verifiable rewards for broader settings. Training is further augmented with synthetic reasoning traces that supply causally plausible explanations for local transitions. Models under 10B parameters, including Qwen3.5-9B, achieve consistent improvements over untuned LLs

What carries the argument

Successive counterfactual imagining: training on pairs of real clinical trials that differ in one controlled attribute so the model learns to predict the resulting change in outcome and thereby approximates the underlying causal structure.

Load-bearing premise

Pairs of real clinical trials that differ in only one attribute give models enough signal to recover true causal effects rather than spurious correlations.

What would settle it

No performance gain on held-out trials when the training pairs are replaced by random non-counterfactual matches or when evaluation uses trials with attribute combinations absent from the counterfactual set.

Figures

Figures reproduced from arXiv: 2604.23054 by Aditya K. Sehgal, Christopher D. Rosin, Hanyuan Zhang, Jianyou Wang, Longtian Bao, Matthew Feng, Maxim Khan, Ramamohan Paturi, Umber Dube, Youze Zheng, Yuhan Chen.

**Figure 1.** Figure 1: An example of Successive Counterfactual Imagination that allows a model to predict the view at source ↗

**Figure 2.** Figure 2: Training loss curves under two perturbation data. view at source ↗

read the original abstract

Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DeepImagine, a framework for teaching LLMs biomedical reasoning via successive counterfactual imagining. It constructs natural and approximate counterfactual pairs from real clinical trial data (e.g., dose-ranging arms or pairs differing in dosage/geography), applies supervised fine-tuning for strict pairs and reinforcement learning with downstream benchmark rewards for approximate pairs, augments training with synthetic causally plausible reasoning traces, and aims to demonstrate consistent improvements over untuned LLMs and traditional correlational baselines on clinical trial outcome prediction while yielding interpretable signals about trial mechanisms.

Significance. If the empirical results were to confirm the claims, the work would provide a concrete method for moving LLMs beyond correlational prediction toward more mechanistic reasoning in biomedicine. The combination of real-data counterfactual pairs, verifiable RL rewards, and synthetic traces could support more reliable trial outcome forecasting and interpretable model behavior, addressing a known limitation of current LLMs on this task.

major comments (2)

[Abstract] Abstract: The manuscript is framed entirely in terms of intended aims ('we aim to show') and describes a pipeline without reporting completed experiments, quantitative metrics, error bars, ablation results, or statistical comparisons. This absence means the central claim of consistent improvement on clinical trial outcome prediction lacks supporting evidence and cannot be evaluated.
[Method] Method (counterfactual pair construction): The claim that pairs differing in one attribute approximate hidden causal mechanisms is load-bearing for the approach, yet no ablation or analysis addresses the risk that such pairs (even within-trial) embed unobserved confounders, selection effects, or spurious correlations. The RL objective directly optimizes for benchmark accuracy, which can be achieved by any predictive feature in the pairs rather than causal structure; no test isolating the single-attribute constraint is described.

minor comments (1)

[Abstract] Abstract: Repeated use of 'we aim to show' makes the status of the work unclear; if full results exist, revise to report them directly rather than as prospective goals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript is framed entirely in terms of intended aims ('we aim to show') and describes a pipeline without reporting completed experiments, quantitative metrics, error bars, ablation results, or statistical comparisons. This absence means the central claim of consistent improvement on clinical trial outcome prediction lacks supporting evidence and cannot be evaluated.

Authors: We agree that the abstract's prospective phrasing ('we aim to show') undercuts the presentation of completed work. The manuscript body describes training Qwen3.5-9B and other models under 10B parameters on the constructed pairs, followed by evaluation on clinical trial outcome prediction with comparisons to untuned LLMs and correlational baselines. To resolve the mismatch, we will revise the abstract to report the key quantitative results, including performance deltas, metrics used, and statistical comparisons, while removing the 'aim to' language. This change will make the abstract accurately reflect the experimental sections. revision: yes
Referee: [Method] Method (counterfactual pair construction): The claim that pairs differing in one attribute approximate hidden causal mechanisms is load-bearing for the approach, yet no ablation or analysis addresses the risk that such pairs (even within-trial) embed unobserved confounders, selection effects, or spurious correlations. The RL objective directly optimizes for benchmark accuracy, which can be achieved by any predictive feature in the pairs rather than causal structure; no test isolating the single-attribute constraint is described.

Authors: This is a valid methodological concern. Within-trial pairs (dose-ranging arms, paired outcome measures) reduce many confounders by construction, but we acknowledge that unobserved factors or selection effects could persist and that the RL reward on benchmark accuracy does not by itself guarantee causal structure. We will add (1) an ablation comparing single-attribute versus multi-attribute pairs to test the isolating effect of the constraint, (2) analysis of generated reasoning traces to check whether models reference the specific perturbed attribute when predicting outcomes, and (3) an explicit limitations paragraph discussing residual confounding risks and the correlational nature of the RL objective. These additions will directly address the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external data and benchmarks

full rationale

The paper constructs counterfactual pairs from real clinical trial data and trains via SFT or RL with rewards tied to downstream benchmark accuracy. No equations, self-definitions, or load-bearing self-citations are present that reduce the claimed improvement in causal reasoning to a quantity defined by the method itself. The approach uses external trial records and verifiable external benchmarks, keeping the central claim independent rather than tautological. Standard RL optimization for benchmark performance does not constitute a circular prediction under the defined patterns, as no fitted input is renamed as an independent result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that counterfactual pairs extracted from published trials can serve as proxies for causal mechanisms; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Counterfactual pairs differing in one trial attribute approximate the hidden causal mechanisms governing clinical outcomes
This premise justifies both the supervised fine-tuning and the reinforcement-learning stages.

pith-pipeline@v0.9.0 · 5622 in / 1306 out tokens · 39189 ms · 2026-05-08T11:26:37.180594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 18 canonical work pages · 7 internal anchors

[1]

W. Cao, J. Wang, Y . Zheng, L. Bao, Q. Zheng, T. Berg-Kirkpatrick, R. Paturi, and L. Bergen. Single-pass document scanning for question answering. arXiv preprint arXiv:2504.03101, 2025

work page arXiv 2025
[2]

M. Chen, X. Chen, and W.-t. Yih. Few-shot data synthesis for open domain multi-hop question answering. arXiv preprint arXiv:2305.13691, 2024

work page arXiv 2024
[3]

Y . Chen, V . K. Singh, J. Ma, and R. Tang. Counterbench: Evaluating and improving counterfac- tual reasoning in large language models. arXiv preprint arXiv:2502.11008, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

T. Fu, K. Huang, C. Xiao, L. M. Glass, and J. Sun. Hint: Hierarchical interaction network for trial outcome prediction leveraging web data. arXiv preprint arXiv:2102.04252, 2021

work page arXiv 2021
[5]

Gandhi, R

S. Gandhi, R. Gala, V . Viswanathan, T. Wu, and G. Neubig. Better synthetic data by re- trieving and transforming existing datasets. In Findings of the Association for Computational Linguistics: ACL 2024, 2024

2024
[6]

C. Gao, J. Pradeepkumar, T. Das, S. Thati, and J. Sun. Automatically labeling clinical trial outcomes: A large-scale benchmark for drug development. arXiv preprint arXiv:2406.10292, 2024

work page arXiv 2024
[7]

T. M. Greenhalgh and P. Dijkstra. How to Read a Paper: The Basics of Evidence-Based Healthcare. Wiley-Blackwell, 7 edition, 2024

2024
[8]

Y . Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 2021

2021
[9]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), 2019

2019
[10]

Z. Jin, Y . Chen, F. Leeb, L. Gresele, O. Kamal, Z. Lyu, K. Blin, F. Gonzalez Adauto, M. Kleiman- Weiner, M. Sachan, and B. Schölkopf. Cladder: Assessing causal reasoning in language models. In Advances in Neural Information Processing Systems, 2023

2023
[11]

Causal reasoning and large language models: Opening a new frontier for causality

E. Kıcıman, R. Ness, A. Sharma, and C. Tan. Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research, 2023. arXiv:2305.00050

work page arXiv 2023
[12]

Krithara, A

A. Krithara, A. Nentidis, K. Bougiatiotis, and G. Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(170), 2023

2023
[13]

C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. Nv-embed: Improved techniques for training llms as generalist embedding models, 2025. URL https: //arxiv.org/abs/2405.17428

work page internal anchor Pith review arXiv 2025
[14]

Let's Verify Step by Step

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review arXiv 2023
[15]

Miliani, S

M. Miliani, S. Auriemma, A. Bondielli, E. Chersoni, L. C. Passaro, I. Sucameli, and A. Lenci. Explica: Evaluating explicit causal reasoning in large language models. In Findings of the Association for Computational Linguistics: ACL 2025, 2025

2025
[16]

OpenAI o3-mini System Card

OpenAI. OpenAI o3-mini System Card. https://cdn.openai.com/ o3-mini-system-card-feb10.pdf, Jan. 2025

2025
[17]

GPT-5.4 Thinking System Card

OpenAI. GPT-5.4 Thinking System Card. https://deploymentsafety.openai.com/ gpt-5-4-thinking, Mar. 2026

2026
[18]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review arXiv 2022
[19]

J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009

2009
[20]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 13

work page internal anchor Pith review arXiv 2017
[21]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[22]

Stiennon, L

N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020

2020
[23]

Q. Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5

2026
[24]

Executable counterfactuals: Improving llms’ causal reasoning through code.arXiv preprint arXiv:2510.01539, 2025

A. Vashishtha et al. Executable counterfactuals: Improving LLMs’ causal reasoning through code. arXiv preprint arXiv:2510.01539, 2025

work page arXiv 2025
[25]

J. Wang, K. Wang, X. Wang, W. Cao, R. Paturi, and L. Bergen. IR2: Information regular- ization for information retrieval. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9261–9284, Torino, Italia, 2024. ELRA and ICCL

2024
[26]

J. Wang, W. Cao, L. Bao, Y . Zheng, G. Pasternak, K. Wang, X. Wang, R. Paturi, and L. Bergen. Measuring risk of bias in biomedical reports: The robbr benchmark. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[27]

J. Wang, W. Cao, K. Wang, X. Wang, A. Dalvi, G. Prasad, Q. Liang, H.-l. Her, M. Wang, Q. Yang, G. W. Yeo, D. E. Neal, M. Khan, C. D. Rosin, R. Paturi, and L. Bergen. Evidencebench: A benchmark for extracting evidence from biomedical papers. arXiv preprint arXiv:2504.18736, 2025

work page arXiv 2025
[28]

J. Wang, Y . Zheng, L. Bao, H. Zhang, Q. Zheng, Y . Chen, Y . Zhang, M. Feng, M. Khan, A. K. Sehgal, C. D. Rosin, R. Paturi, U. Dube, and L. Bergen. Ct open: An open-access, uncontaminated, live platform for the open challenge of clinical trial outcome prediction, 2026. URLhttps://arxiv.org/abs/2604.16742

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Wang and J

Z. Wang and J. Sun. Trial2vec: Zero-shot clinical trial document similarity search using self-supervision. arXiv preprint arXiv:2206.14719, 2022

work page arXiv 2022
[30]

Z. Wang, B. Theodorou, T. Fu, C. Xiao, and J. Sun. Pytrial: Machine learning software and benchmark for clinical trial applications. arXiv preprint arXiv:2306.04018, 2023

work page arXiv 2023
[31]

Z. Wang, C. Xiao, and J. Sun. Spot: Sequential predictive modeling of clinical trial outcome with meta-learning. arXiv preprint arXiv:2304.05352, 2023

work page arXiv 2023
[32]

Yehudai, B

A. Yehudai, B. Carmeli, Y . Mass, O. Arviv, N. Mills, A. Toledo, E. Shnarch, and L. Choshen. Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367, 2024. 14

work page arXiv 2024