Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records
Pith reviewed 2026-05-19 02:15 UTC · model grok-4.3
The pith
Expert annotations during training enable valid treatment benefit predictions from multimodal EHR representations alone at inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AACE (Annotation-Assisted Coarsened Effects) uses expert-provided annotations during training to support confounding adjustment in multimodal electronic health records, allowing the model to predict treatment benefits accurately from multimodal representations without annotations at inference time.
What carries the argument
Annotation-Assisted Coarsened Effects (AACE), which leverages expert annotations at training to adjust for confounding while learning predictors that operate solely on multimodal representations at test time.
If this is right
- Treatment policies can be deployed using only multimodal EHR data at the point of care without requiring ongoing expert annotations.
- The learned policies identify patients with the largest expected treatment benefit rather than simply those at highest baseline risk.
- Performance gains appear across synthetic, semi-synthetic, and real-world EHR datasets compared with risk-based and representation-based causal baselines.
- Annotations are needed only during model development, lowering the barrier to applying causal methods in multimodal clinical data.
Where Pith is reading between the lines
- The same training-time annotation strategy could extend to other multimodal domains where rich sensor or text data exist but expert labels are expensive to maintain at scale.
- Minimizing the annotation budget while preserving adjustment quality would be a direct next test of practicality.
- Domain-specific annotation protocols might be required when the confounders in EHR text differ from those in tabular fields.
Load-bearing premise
Expert annotations collected only in training capture the confounding information needed so that multimodal representations alone produce valid treatment benefit estimates at inference.
What would settle it
Observe patient outcomes under policies derived from AACE versus baselines in a prospective clinical setting; if the method shows no reduction in bias or no improvement in benefit identification when annotations are absent at deployment, the central claim fails.
Figures
read the original abstract
We study how to learn treatment policies from multimodal electronic health records (EHRs) that consist of tabular data and clinical text. These policies can help physicians make better treatment decisions and allocate healthcare resources more efficiently. Causal policy learning methods prioritize patients with the largest expected treatment benefit. Yet, existing estimators are designed for tabular covariates under causal assumptions that may be hard to justify in the multimodal setting. A pragmatic alternative is to apply causal estimators directly to multimodal representations, but this can produce biased treatment effect estimates when the representations do not preserve the relevant confounding information. As a result, predictive models of baseline risk are commonly used in practice to guide treatment decisions, although they are not designed to identify which patients benefit most from treatment. We propose AACE (Annotation-Assisted Coarsened Effects), an annotation-assisted approach to causal policy learning for multimodal EHRs. The method uses expert-provided annotations during training to support confounding adjustment, and then predicts treatment benefit from only multimodal representations at inference. We show that the proposed method achieves strong empirical performance across synthetic, semi-synthetic, and real-world EHR datasets, outperforming risk-based and representation-based causal baselines, and offering practical insights for applying causal machine learning in clinical practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AACE (Annotation-Assisted Coarsened Effects), a method for learning treatment policies from multimodal EHRs (tabular + clinical text). Expert annotations are used only during training to support coarsened-effect confounding adjustment; at inference the model predicts treatment benefit from multimodal representations alone. Empirical results across synthetic, semi-synthetic, and real-world datasets are reported to show outperformance relative to risk-based and representation-based causal baselines.
Significance. If the central empirical claims hold, the work supplies a pragmatic route to causal policy learning in multimodal clinical data where annotations are costly at deployment. The multi-regime evaluation (synthetic through real EHR) is a concrete strength that allows direct comparison of policy value estimates.
major comments (2)
- [§3.2] §3.2 (Method): the claim that the learned multimodal encoder necessarily recovers the annotation-adjusted treatment-benefit ordering at inference is asserted without a supporting bound, sensitivity result, or identifiability argument. Because the reported policy-value gains rest on this preservation property, the absence of such analysis makes the validity of the annotation-free inference step load-bearing for the central claim.
- [§4.3] §4.3 (Real-world EHR experiments): the performance tables report point estimates of policy value but do not include error bars, bootstrap intervals, or sensitivity checks to annotation quality or residual confounding; without these, it is impossible to assess whether the observed gains over representation-based baselines are robust or dataset-specific.
minor comments (2)
- [§3] Notation for the coarsened effect estimator and the multimodal encoder should be unified across §3.1 and §3.2 to avoid ambiguity in how the annotation signal is injected during training.
- [Figure 2] Figure 2 (synthetic data results) would benefit from explicit labeling of the x-axis as the degree of confounding strength rather than an opaque index.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Method): the claim that the learned multimodal encoder necessarily recovers the annotation-adjusted treatment-benefit ordering at inference is asserted without a supporting bound, sensitivity result, or identifiability argument. Because the reported policy-value gains rest on this preservation property, the absence of such analysis makes the validity of the annotation-free inference step load-bearing for the central claim.
Authors: We appreciate the referee's observation that a formal bound or identifiability argument would strengthen the presentation. The AACE training objective explicitly incorporates expert annotations to produce coarsened-effect-adjusted targets; the multimodal encoder is then optimized to predict these targets from the available modalities. This design choice is intended to align the learned representations with the annotation-adjusted ordering. While the original submission emphasizes empirical validation across synthetic, semi-synthetic, and real regimes rather than a complete theoretical guarantee, we will add a dedicated paragraph in §3.2 discussing the preservation property under the coarsened-effects framework and include a sensitivity analysis that varies annotation quality and measures resulting changes in policy value. A full identifiability proof remains an open theoretical question. revision: partial
-
Referee: [§4.3] §4.3 (Real-world EHR experiments): the performance tables report point estimates of policy value but do not include error bars, bootstrap intervals, or sensitivity checks to annotation quality or residual confounding; without these, it is impossible to assess whether the observed gains over representation-based baselines are robust or dataset-specific.
Authors: We agree that uncertainty quantification and sensitivity checks are necessary to evaluate robustness. In the revised manuscript we will replace the point estimates in the real-world tables of §4.3 with bootstrap confidence intervals (1,000 resamples) and add two new sensitivity panels: one varying the fraction and quality of annotations used during training, and one examining performance under increasing levels of simulated residual confounding. These additions will allow readers to assess whether the reported gains are stable across plausible annotation and confounding regimes. revision: yes
Circularity Check
No significant circularity; derivation relies on external annotations and empirical validation rather than self-referential definitions or fits
full rationale
The paper proposes the AACE method, which incorporates expert annotations only during training to aid confounding adjustment on multimodal EHR inputs and then drops them at inference to predict from learned representations. No load-bearing derivation step reduces by construction to its own inputs, fitted parameters, or self-citation chains; the central claim of improved policy learning is supported by explicit experiments across synthetic, semi-synthetic, and real-world datasets rather than being defined into the result. The method is self-contained against external benchmarks, with validity treated as an empirical question.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations during training can support valid confounding adjustment for multimodal data.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose AACE (Annotation-Assisted Coarsened Effects), an annotation-assisted approach to causal policy learning for multimodal EHRs. The method uses expert-provided annotations during training to support confounding adjustment, and then predicts treatment benefit from only multimodal representations at inference.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The doubly robust learner... nuisance functions... pseudo-outcome Δx_i
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bayesian inference of individualized treatment effects using multi-task gaussian processes
Ahmed Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[2]
From text to treatment effects: A meta-learning approach to handling text-based confounding
Henri Arno, Paloma Rabaey, and Thomas Demeester. From text to treatment effects: A meta-learning approach to handling text-based confounding. In NeurIPS 2024 Workshop on Causal Representation Learning (CRL), 2024. 9
work page 2024
-
[3]
Proximal causal inference with text data
Jacob Chen, Rohit Bhattacharya, and Katherine Keith. Proximal causal inference with text data. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[4]
Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms
Alicia Curth and Mihaela van der Schaar. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 2021
work page 2021
-
[5]
On inductive biases for heterogeneous treatment effect estimation
Alicia Curth and Mihaela van der Schaar. On inductive biases for heterogeneous treatment effect estimation. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[6]
Conceptualizing treatment leakage in text-based causal inference, 2022
Adel Daoud, Connor Jerzak, and Richard Johansson. Conceptualizing treatment leakage in text-based causal inference, 2022. preprint - arXiv:2205.00465
-
[7]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019
work page 2019
-
[8]
How to make causal inferences using texts
Naoki Egami, Christian Fong, Justin Grimmer, Margaret Roberts, and Brandon Stewart. How to make causal inferences using texts. Science Advances, 8(42), 2022
work page 2022
-
[9]
Language-agnostic BERT sentence embedding
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
work page 2022
-
[10]
Causal machine learning for predicting treatment outcomes
Stefan Feuerriegel, Dennis Frauen, Valentyn Melnychuk, Jonas Schweisthal, Konstantin Hess, Alicia Curth, Stefan Bauer, Niki Kilbertus, Isaac Kohane, and Mihaela van der Schaar. Causal machine learning for predicting treatment outcomes. Nature Medicine, 30(4), 2024
work page 2024
-
[11]
SimCSE: Simple contrastive learning of sentence embed- dings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embed- dings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
work page 2021
-
[12]
Operationalizing complex causes: A pragmatic view of mediation
Limor Gultchin, David Watson, Matt Kusner, and Ricardo Silva. Operationalizing complex causes: A pragmatic view of mediation. In Proceedings of the 38th International Conference on Machine Learning, 2021
work page 2021
-
[13]
Image-based treatment effect heterogeneity
Connor Jerzak, Fredrik Johansson, and Adel Daoud. Image-based treatment effect heterogeneity. In Proceedings of the 2nd Conference on Causal Learning and Reasoning, 2023
work page 2023
-
[14]
Quantifying ignorance in individual-level causal-effect estimates under hidden confounding
Andrew Jesson, Sören Mindermann, Yarin Gal, and Uri Shalit. Quantifying ignorance in individual-level causal-effect estimates under hidden confounding. In Proceedings of the 38th International Conference on Machine Learning, 2021
work page 2021
-
[15]
Fredrik Johansson, Uri Shalit, Nathan Kallus, and David Sontag. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. Journal of Machine Learning Research, 23(166), 2022
work page 2022
-
[16]
Learning representations for counterfactual inference
Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning, 2016
work page 2016
-
[17]
MIMIC-III, a freely accessible critical care database
Alistair Johnson, Tom Pollard, Lu Shen, Li-wei Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger Mark. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 2016
work page 2016
-
[18]
Causal inference with noisy and missing covariates via matrix factorization
Nathan Kallus, Xiaojie Mao, and Madeleine Udell. Causal inference with noisy and missing covariates via matrix factorization. Advances in Neural Information Processing Systems, 2018
work page 2018
-
[19]
Text and causal inference: A review of using text to remove confounding from causal estimates
Katherine Keith, David Jensen, and Brendan O’Connor. Text and causal inference: A review of using text to remove confounding from causal estimates. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
work page 2020
-
[20]
Katherine Keith, Douglas Rice, and Brendan O’Connor. Text as causal mediators: research design for causal estimates of differential treatment of social groups via language aspects. In EMNLP 2021 Workshop on Causal Inference and NLP, 2021
work page 2021
-
[21]
Towards optimal doubly robust estimation of heterogeneous causal effects
Edward Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2), 2023
work page 2023
-
[22]
Doublemldeep: Estimation of causal effects with multimodal data, 2024
Sven Klaassen, Jan Teichert-Kluge, Philipp Bach, Victor Chernozhukov, Martin Spindler, and Suhas Vijaykumar. Doublemldeep: Estimation of causal effects with multimodal data, 2024. preprint - arXiv:2402.01785
-
[23]
Meta learners for estimating heterogeneous treatment effects using machine learning
Sören Künzel, Jasjeet Sekhon, Peter Bickel, and Bin Yu. Meta learners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences , 116(10), 2019
work page 2019
-
[24]
Llm-driven treatment effect estimation under inference time text confounding, 2025
Yuchen Ma, Dennis Frauen, Jonas Schweisthal, and Stefan Feuerriegel. Llm-driven treatment effect estimation under inference time text confounding, 2025. preprint - arXiv:2507.02843. 10
-
[25]
CausalNLP: A practical toolkit for causal inference with text, 2021
Arun Maiya. CausalNLP: A practical toolkit for causal inference with text, 2021. preprint - arXiv:2106.08043
-
[26]
Bounds on representation-induced confound- ing bias for treatment effect estimation
Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Bounds on representation-induced confound- ing bias for treatment effect estimation. In Proceedings of the 12th International Conference on Learning Representations, 2024
work page 2024
-
[27]
Orthogonal Representation Learning for Estimating Causal Quantities
Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, and Stefan Feuerriegel. Orthogonal representation learning for estimating causal quantities, 2025. preprint - arXiv:2502.04274
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Pawel Morzywolek, Johan Decruyenaere, and Stijn Vansteelandt. On weighted orthogonal learners for heterogeneous treatment effects, 2024. preprint - arXiv:2303.12687v2
-
[29]
Reagan Mozer, Luke Miratrix, Aaron Russell Kaufman, and Jason Anastasopoulos. Matching with text data: An experimental evaluation of methods for matching documents and of measuring match quality. Political Analysis, 28(4), 2020
work page 2020
-
[30]
Quasi-oracle estimation of heterogeneous treatment effects
Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 2020
work page 2020
-
[31]
Causal effects of linguistic properties
Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar. Causal effects of linguistic properties. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, 2021
work page 2021
-
[32]
Synsum – synthetic benchmark with structured and unstructured medical records
Paloma Rabaey, Henri Arno, Stefan Heytens, and Thomas Demeester. Synsum – synthetic benchmark with structured and unstructured medical records. In AAAI 2025 Workshop on GenAI4Health, 2024
work page 2025
-
[33]
François Remy, Kris Demuynck, and Thomas Demeester. Biolord-2023: semantic textual representations fusing large language models and clinical knowledge graph insights. Journal of the American Medical Informatics Association, 31(9), 2024
work page 2023
-
[34]
Adjusting for confounding with text matching
Margaret Roberts, Brandon Stewart, and Richard Nielsen. Adjusting for confounding with text matching. American Journal of Political Science, 64(4), 2020
work page 2020
-
[35]
Causal inference using potential outcomes
Donald Rubin. Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469), 2005
work page 2005
-
[36]
Diffusion causal models for counterfactual estimation
Pedro Sanchez and Sotirios Tsaftaris. Diffusion causal models for counterfactual estimation. InProceedings of the 1st Conference on Causal Learning and Reasoning, 2022
work page 2022
-
[37]
Estimating individual treatment effect: generalization bounds and algorithms
Uri Shalit, Fredrik Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning, 2017
work page 2017
-
[38]
Adapting neural networks for the estimation of treatment effects
Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, 2019
work page 2019
-
[39]
I see, therefore i do: Estimating causal effects for image treatments, 2024
Abhinav Thorat, Ravi Kolla, and Niranjan Pedanekar. I see, therefore i do: Estimating causal effects for image treatments, 2024. preprint - arXiv:2412.06810
-
[40]
Adapting text embeddings for causal inference
Victor Veitch, Dhanya Sridhar, and David Blei. Adapting text embeddings for causal inference. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, 2020
work page 2020
-
[41]
Estimation and inference of heterogeneous treatment effects using random forests
Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 2018
work page 2018
-
[42]
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, 20...
work page internal anchor Pith review arXiv 2024
-
[43]
Galen Weld, Peter West, Maria Glenski, David Arbour, Ryan Rossi, and Tim Althoff. Adjusting for confounders with text: Challenges and an empirical evaluation framework for causal inference. In Proceedings of the 15th International AAAI Conference on Web and Social Media, 2022
work page 2022
-
[44]
Challenges of using text classifiers for causal inference
Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. Challenges of using text classifiers for causal inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[45]
GANITE: Estimation of individualized treatment effects using generative adversarial nets
Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GANITE: Estimation of individualized treatment effects using generative adversarial nets. In Proceedings of the 6th International Conference on Learning Representations, 2018
work page 2018
-
[46]
Fucheng Warren Zhu, Connor Jerzak, and Adel Daoud. Optimizing multi-scale representations to detect effect heterogeneity using earth observation and computer vision: Application to two anti-poverty rcts. In Proceedings of the 4th Conference on Causal Learning and Reasoning, 2025. 11 Appendix This appendix provides additional technical details to support t...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.