Recognition: unknown
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
Pith reviewed 2026-05-08 11:26 UTC · model grok-4.3
The pith
DeepImagine trains small language models on successive counterfactual changes drawn from real clinical trials to improve outcome predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepImagine approximates hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions such as dosage, outcome measures, study arms, geography, and other trial attributes. Natural and approximate counterfactual pairs from real clinical trials support supervised fine-tuning for strict pairs and reinforcement learning with verifiable rewards for broader settings. Training is further augmented with synthetic reasoning traces that supply causally plausible explanations for local transitions. Models under 10B parameters, including Qwen3.5-9B, achieve consistent improvements over untuned LLs
What carries the argument
Successive counterfactual imagining: training on pairs of real clinical trials that differ in one controlled attribute so the model learns to predict the resulting change in outcome and thereby approximates the underlying causal structure.
Load-bearing premise
Pairs of real clinical trials that differ in only one attribute give models enough signal to recover true causal effects rather than spurious correlations.
What would settle it
No performance gain on held-out trials when the training pairs are replaced by random non-counterfactual matches or when evaluation uses trials with attribute combinations absent from the counterfactual set.
Figures
read the original abstract
Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DeepImagine, a framework for teaching LLMs biomedical reasoning via successive counterfactual imagining. It constructs natural and approximate counterfactual pairs from real clinical trial data (e.g., dose-ranging arms or pairs differing in dosage/geography), applies supervised fine-tuning for strict pairs and reinforcement learning with downstream benchmark rewards for approximate pairs, augments training with synthetic causally plausible reasoning traces, and aims to demonstrate consistent improvements over untuned LLMs and traditional correlational baselines on clinical trial outcome prediction while yielding interpretable signals about trial mechanisms.
Significance. If the empirical results were to confirm the claims, the work would provide a concrete method for moving LLMs beyond correlational prediction toward more mechanistic reasoning in biomedicine. The combination of real-data counterfactual pairs, verifiable RL rewards, and synthetic traces could support more reliable trial outcome forecasting and interpretable model behavior, addressing a known limitation of current LLMs on this task.
major comments (2)
- [Abstract] Abstract: The manuscript is framed entirely in terms of intended aims ('we aim to show') and describes a pipeline without reporting completed experiments, quantitative metrics, error bars, ablation results, or statistical comparisons. This absence means the central claim of consistent improvement on clinical trial outcome prediction lacks supporting evidence and cannot be evaluated.
- [Method] Method (counterfactual pair construction): The claim that pairs differing in one attribute approximate hidden causal mechanisms is load-bearing for the approach, yet no ablation or analysis addresses the risk that such pairs (even within-trial) embed unobserved confounders, selection effects, or spurious correlations. The RL objective directly optimizes for benchmark accuracy, which can be achieved by any predictive feature in the pairs rather than causal structure; no test isolating the single-attribute constraint is described.
minor comments (1)
- [Abstract] Abstract: Repeated use of 'we aim to show' makes the status of the work unclear; if full results exist, revise to report them directly rather than as prospective goals.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript is framed entirely in terms of intended aims ('we aim to show') and describes a pipeline without reporting completed experiments, quantitative metrics, error bars, ablation results, or statistical comparisons. This absence means the central claim of consistent improvement on clinical trial outcome prediction lacks supporting evidence and cannot be evaluated.
Authors: We agree that the abstract's prospective phrasing ('we aim to show') undercuts the presentation of completed work. The manuscript body describes training Qwen3.5-9B and other models under 10B parameters on the constructed pairs, followed by evaluation on clinical trial outcome prediction with comparisons to untuned LLMs and correlational baselines. To resolve the mismatch, we will revise the abstract to report the key quantitative results, including performance deltas, metrics used, and statistical comparisons, while removing the 'aim to' language. This change will make the abstract accurately reflect the experimental sections. revision: yes
-
Referee: [Method] Method (counterfactual pair construction): The claim that pairs differing in one attribute approximate hidden causal mechanisms is load-bearing for the approach, yet no ablation or analysis addresses the risk that such pairs (even within-trial) embed unobserved confounders, selection effects, or spurious correlations. The RL objective directly optimizes for benchmark accuracy, which can be achieved by any predictive feature in the pairs rather than causal structure; no test isolating the single-attribute constraint is described.
Authors: This is a valid methodological concern. Within-trial pairs (dose-ranging arms, paired outcome measures) reduce many confounders by construction, but we acknowledge that unobserved factors or selection effects could persist and that the RL reward on benchmark accuracy does not by itself guarantee causal structure. We will add (1) an ablation comparing single-attribute versus multi-attribute pairs to test the isolating effect of the constraint, (2) analysis of generated reasoning traces to check whether models reference the specific perturbed attribute when predicting outcomes, and (3) an explicit limitations paragraph discussing residual confounding risks and the correlational nature of the RL objective. These additions will directly address the load-bearing assumption. revision: yes
Circularity Check
No significant circularity; derivation relies on external data and benchmarks
full rationale
The paper constructs counterfactual pairs from real clinical trial data and trains via SFT or RL with rewards tied to downstream benchmark accuracy. No equations, self-definitions, or load-bearing self-citations are present that reduce the claimed improvement in causal reasoning to a quantity defined by the method itself. The approach uses external trial records and verifiable external benchmarks, keeping the central claim independent rather than tautological. Standard RL optimization for benchmark performance does not constitute a circular prediction under the defined patterns, as no fitted input is renamed as an independent result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual pairs differing in one trial attribute approximate the hidden causal mechanisms governing clinical outcomes
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Y . Chen, V . K. Singh, J. Ma, and R. Tang. Counterbench: Evaluating and improving counterfac- tual reasoning in large language models. arXiv preprint arXiv:2502.11008, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
Gandhi, R
S. Gandhi, R. Gala, V . Viswanathan, T. Wu, and G. Neubig. Better synthetic data by re- trieving and transforming existing datasets. In Findings of the Association for Computational Linguistics: ACL 2024, 2024
2024
- [6]
-
[7]
T. M. Greenhalgh and P. Dijkstra. How to Read a Paper: The Basics of Evidence-Based Healthcare. Wiley-Blackwell, 7 edition, 2024
2024
-
[8]
Y . Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 2021
2021
-
[9]
Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), 2019
2019
-
[10]
Z. Jin, Y . Chen, F. Leeb, L. Gresele, O. Kamal, Z. Lyu, K. Blin, F. Gonzalez Adauto, M. Kleiman- Weiner, M. Sachan, and B. Schölkopf. Cladder: Assessing causal reasoning in language models. In Advances in Neural Information Processing Systems, 2023
2023
-
[11]
Causal reasoning and large language models: Opening a new frontier for causality
E. Kıcıman, R. Ness, A. Sharma, and C. Tan. Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research, 2023. arXiv:2305.00050
-
[12]
Krithara, A
A. Krithara, A. Nentidis, K. Bougiatiotis, and G. Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(170), 2023
2023
-
[13]
C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. Nv-embed: Improved techniques for training llms as generalist embedding models, 2025. URL https: //arxiv.org/abs/2405.17428
work page internal anchor Pith review arXiv 2025
-
[14]
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
Miliani, S
M. Miliani, S. Auriemma, A. Bondielli, E. Chersoni, L. C. Passaro, I. Sucameli, and A. Lenci. Explica: Evaluating explicit causal reasoning in large language models. In Findings of the Association for Computational Linguistics: ACL 2025, 2025
2025
-
[16]
OpenAI o3-mini System Card
OpenAI. OpenAI o3-mini System Card. https://cdn.openai.com/ o3-mini-system-card-feb10.pdf, Jan. 2025
2025
-
[17]
GPT-5.4 Thinking System Card
OpenAI. GPT-5.4 Thinking System Card. https://deploymentsafety.openai.com/ gpt-5-4-thinking, Mar. 2026
2026
-
[18]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022
work page internal anchor Pith review arXiv 2022
-
[19]
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009
2009
-
[20]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 13
work page internal anchor Pith review arXiv 2017
-
[21]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review arXiv 2024
-
[22]
Stiennon, L
N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020
2020
-
[23]
Q. Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5
2026
-
[24]
A. Vashishtha et al. Executable counterfactuals: Improving LLMs’ causal reasoning through code. arXiv preprint arXiv:2510.01539, 2025
-
[25]
J. Wang, K. Wang, X. Wang, W. Cao, R. Paturi, and L. Bergen. IR2: Information regular- ization for information retrieval. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9261–9284, Torino, Italia, 2024. ELRA and ICCL
2024
-
[26]
J. Wang, W. Cao, L. Bao, Y . Zheng, G. Pasternak, K. Wang, X. Wang, R. Paturi, and L. Bergen. Measuring risk of bias in biomedical reports: The robbr benchmark. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
2025
- [27]
-
[28]
J. Wang, Y . Zheng, L. Bao, H. Zhang, Q. Zheng, Y . Chen, Y . Zhang, M. Feng, M. Khan, A. K. Sehgal, C. D. Rosin, R. Paturi, U. Dube, and L. Bergen. Ct open: An open-access, uncontaminated, live platform for the open challenge of clinical trial outcome prediction, 2026. URLhttps://arxiv.org/abs/2604.16742
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Z. Wang and J. Sun. Trial2vec: Zero-shot clinical trial document similarity search using self-supervision. arXiv preprint arXiv:2206.14719, 2022
- [30]
- [31]
-
[32]
A. Yehudai, B. Carmeli, Y . Mass, O. Arviv, N. Mills, A. Toledo, E. Shnarch, and L. Choshen. Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367, 2024. 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.