pith. sign in

arxiv: 2507.20993 · v3 · submitted 2025-07-28 · 💻 cs.LG · cs.AI· stat.ML

Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records

Pith reviewed 2026-05-19 02:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords causal policy learningmultimodal EHRtreatment policiesconfounding adjustmentannotation-assisted learningclinical decision supportelectronic health records
0
0 comments X

The pith

Expert annotations during training enable valid treatment benefit predictions from multimodal EHR representations alone at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to learn treatment policies from multimodal electronic health records that combine tabular data and clinical text. Standard causal estimators risk bias when applied directly to learned representations because those representations may omit key confounding factors. The approach incorporates expert annotations only during training to perform confounding adjustment, then trains a predictor that outputs treatment benefit estimates from multimodal representations at inference time without any annotations. This setup matters because it allows causal policy learning to work in real clinical settings where annotations are costly to obtain continuously while still respecting the need to identify patients who benefit most from treatment rather than just high-risk ones. Empirical results on synthetic, semi-synthetic, and real EHR data show the method outperforms both risk-based baselines and direct representation-based causal estimators.

Core claim

AACE (Annotation-Assisted Coarsened Effects) uses expert-provided annotations during training to support confounding adjustment in multimodal electronic health records, allowing the model to predict treatment benefits accurately from multimodal representations without annotations at inference time.

What carries the argument

Annotation-Assisted Coarsened Effects (AACE), which leverages expert annotations at training to adjust for confounding while learning predictors that operate solely on multimodal representations at test time.

If this is right

  • Treatment policies can be deployed using only multimodal EHR data at the point of care without requiring ongoing expert annotations.
  • The learned policies identify patients with the largest expected treatment benefit rather than simply those at highest baseline risk.
  • Performance gains appear across synthetic, semi-synthetic, and real-world EHR datasets compared with risk-based and representation-based causal baselines.
  • Annotations are needed only during model development, lowering the barrier to applying causal methods in multimodal clinical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training-time annotation strategy could extend to other multimodal domains where rich sensor or text data exist but expert labels are expensive to maintain at scale.
  • Minimizing the annotation budget while preserving adjustment quality would be a direct next test of practicality.
  • Domain-specific annotation protocols might be required when the confounders in EHR text differ from those in tabular fields.

Load-bearing premise

Expert annotations collected only in training capture the confounding information needed so that multimodal representations alone produce valid treatment benefit estimates at inference.

What would settle it

Observe patient outcomes under policies derived from AACE versus baselines in a prospective clinical setting; if the method shows no reduction in bias or no improvement in benefit identification when annotations are absent at deployment, the central claim fails.

Figures

Figures reproduced from arXiv: 2507.20993 by Henri Arno, Thomas Demeester.

Figure 1
Figure 1. Figure 1: Overview of the plug-in method for estimating treatment effects from unstructured data. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of methods for estimating τ ϕ (ϕ) from a subset of structured data (the information extraction approach in the yellow panel, and direct regression in red), with an optional correction for sampling bias (blue panel). Nuisance functions are estimated on the structured subset (S = 1), enabling construction of DR pseudo-outcomes ∆x . The proposed estimators leverage these to estimate the target effect… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of all methods on the MIMIC (top row) and SynSUM (bottom row) datasets, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

We study how to learn treatment policies from multimodal electronic health records (EHRs) that consist of tabular data and clinical text. These policies can help physicians make better treatment decisions and allocate healthcare resources more efficiently. Causal policy learning methods prioritize patients with the largest expected treatment benefit. Yet, existing estimators are designed for tabular covariates under causal assumptions that may be hard to justify in the multimodal setting. A pragmatic alternative is to apply causal estimators directly to multimodal representations, but this can produce biased treatment effect estimates when the representations do not preserve the relevant confounding information. As a result, predictive models of baseline risk are commonly used in practice to guide treatment decisions, although they are not designed to identify which patients benefit most from treatment. We propose AACE (Annotation-Assisted Coarsened Effects), an annotation-assisted approach to causal policy learning for multimodal EHRs. The method uses expert-provided annotations during training to support confounding adjustment, and then predicts treatment benefit from only multimodal representations at inference. We show that the proposed method achieves strong empirical performance across synthetic, semi-synthetic, and real-world EHR datasets, outperforming risk-based and representation-based causal baselines, and offering practical insights for applying causal machine learning in clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AACE (Annotation-Assisted Coarsened Effects), a method for learning treatment policies from multimodal EHRs (tabular + clinical text). Expert annotations are used only during training to support coarsened-effect confounding adjustment; at inference the model predicts treatment benefit from multimodal representations alone. Empirical results across synthetic, semi-synthetic, and real-world datasets are reported to show outperformance relative to risk-based and representation-based causal baselines.

Significance. If the central empirical claims hold, the work supplies a pragmatic route to causal policy learning in multimodal clinical data where annotations are costly at deployment. The multi-regime evaluation (synthetic through real EHR) is a concrete strength that allows direct comparison of policy value estimates.

major comments (2)
  1. [§3.2] §3.2 (Method): the claim that the learned multimodal encoder necessarily recovers the annotation-adjusted treatment-benefit ordering at inference is asserted without a supporting bound, sensitivity result, or identifiability argument. Because the reported policy-value gains rest on this preservation property, the absence of such analysis makes the validity of the annotation-free inference step load-bearing for the central claim.
  2. [§4.3] §4.3 (Real-world EHR experiments): the performance tables report point estimates of policy value but do not include error bars, bootstrap intervals, or sensitivity checks to annotation quality or residual confounding; without these, it is impossible to assess whether the observed gains over representation-based baselines are robust or dataset-specific.
minor comments (2)
  1. [§3] Notation for the coarsened effect estimator and the multimodal encoder should be unified across §3.1 and §3.2 to avoid ambiguity in how the annotation signal is injected during training.
  2. [Figure 2] Figure 2 (synthetic data results) would benefit from explicit labeling of the x-axis as the degree of confounding strength rather than an opaque index.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method): the claim that the learned multimodal encoder necessarily recovers the annotation-adjusted treatment-benefit ordering at inference is asserted without a supporting bound, sensitivity result, or identifiability argument. Because the reported policy-value gains rest on this preservation property, the absence of such analysis makes the validity of the annotation-free inference step load-bearing for the central claim.

    Authors: We appreciate the referee's observation that a formal bound or identifiability argument would strengthen the presentation. The AACE training objective explicitly incorporates expert annotations to produce coarsened-effect-adjusted targets; the multimodal encoder is then optimized to predict these targets from the available modalities. This design choice is intended to align the learned representations with the annotation-adjusted ordering. While the original submission emphasizes empirical validation across synthetic, semi-synthetic, and real regimes rather than a complete theoretical guarantee, we will add a dedicated paragraph in §3.2 discussing the preservation property under the coarsened-effects framework and include a sensitivity analysis that varies annotation quality and measures resulting changes in policy value. A full identifiability proof remains an open theoretical question. revision: partial

  2. Referee: [§4.3] §4.3 (Real-world EHR experiments): the performance tables report point estimates of policy value but do not include error bars, bootstrap intervals, or sensitivity checks to annotation quality or residual confounding; without these, it is impossible to assess whether the observed gains over representation-based baselines are robust or dataset-specific.

    Authors: We agree that uncertainty quantification and sensitivity checks are necessary to evaluate robustness. In the revised manuscript we will replace the point estimates in the real-world tables of §4.3 with bootstrap confidence intervals (1,000 resamples) and add two new sensitivity panels: one varying the fraction and quality of annotations used during training, and one examining performance under increasing levels of simulated residual confounding. These additions will allow readers to assess whether the reported gains are stable across plausible annotation and confounding regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external annotations and empirical validation rather than self-referential definitions or fits

full rationale

The paper proposes the AACE method, which incorporates expert annotations only during training to aid confounding adjustment on multimodal EHR inputs and then drops them at inference to predict from learned representations. No load-bearing derivation step reduces by construction to its own inputs, fitted parameters, or self-citation chains; the central claim of improved policy learning is supported by explicit experiments across synthetic, semi-synthetic, and real-world datasets rather than being defined into the result. The method is self-contained against external benchmarks, with validity treated as an empirical question.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that expert annotations can be obtained and used to correct confounding without introducing new bias, plus standard causal assumptions that may be harder to justify in multimodal EHR settings. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Expert annotations during training can support valid confounding adjustment for multimodal data.
    Invoked to justify why the method works at inference without annotations.

pith-pipeline@v0.9.0 · 5745 in / 1164 out tokens · 28284 ms · 2026-05-19T02:15:11.733722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    Bayesian inference of individualized treatment effects using multi-task gaussian processes

    Ahmed Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. In Advances in Neural Information Processing Systems, 2017

  2. [2]

    From text to treatment effects: A meta-learning approach to handling text-based confounding

    Henri Arno, Paloma Rabaey, and Thomas Demeester. From text to treatment effects: A meta-learning approach to handling text-based confounding. In NeurIPS 2024 Workshop on Causal Representation Learning (CRL), 2024. 9

  3. [3]

    Proximal causal inference with text data

    Jacob Chen, Rohit Bhattacharya, and Katherine Keith. Proximal causal inference with text data. In Advances in Neural Information Processing Systems, 2024

  4. [4]

    Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms

    Alicia Curth and Mihaela van der Schaar. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 2021

  5. [5]

    On inductive biases for heterogeneous treatment effect estimation

    Alicia Curth and Mihaela van der Schaar. On inductive biases for heterogeneous treatment effect estimation. In Advances in Neural Information Processing Systems, 2021

  6. [6]

    Conceptualizing treatment leakage in text-based causal inference, 2022

    Adel Daoud, Connor Jerzak, and Richard Johansson. Conceptualizing treatment leakage in text-based causal inference, 2022. preprint - arXiv:2205.00465

  7. [7]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019

  8. [8]

    How to make causal inferences using texts

    Naoki Egami, Christian Fong, Justin Grimmer, Margaret Roberts, and Brandon Stewart. How to make causal inferences using texts. Science Advances, 8(42), 2022

  9. [9]

    Language-agnostic BERT sentence embedding

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

  10. [10]

    Causal machine learning for predicting treatment outcomes

    Stefan Feuerriegel, Dennis Frauen, Valentyn Melnychuk, Jonas Schweisthal, Konstantin Hess, Alicia Curth, Stefan Bauer, Niki Kilbertus, Isaac Kohane, and Mihaela van der Schaar. Causal machine learning for predicting treatment outcomes. Nature Medicine, 30(4), 2024

  11. [11]

    SimCSE: Simple contrastive learning of sentence embed- dings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embed- dings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

  12. [12]

    Operationalizing complex causes: A pragmatic view of mediation

    Limor Gultchin, David Watson, Matt Kusner, and Ricardo Silva. Operationalizing complex causes: A pragmatic view of mediation. In Proceedings of the 38th International Conference on Machine Learning, 2021

  13. [13]

    Image-based treatment effect heterogeneity

    Connor Jerzak, Fredrik Johansson, and Adel Daoud. Image-based treatment effect heterogeneity. In Proceedings of the 2nd Conference on Causal Learning and Reasoning, 2023

  14. [14]

    Quantifying ignorance in individual-level causal-effect estimates under hidden confounding

    Andrew Jesson, Sören Mindermann, Yarin Gal, and Uri Shalit. Quantifying ignorance in individual-level causal-effect estimates under hidden confounding. In Proceedings of the 38th International Conference on Machine Learning, 2021

  15. [15]

    Generalization bounds and representation learning for estimation of potential outcomes and causal effects

    Fredrik Johansson, Uri Shalit, Nathan Kallus, and David Sontag. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. Journal of Machine Learning Research, 23(166), 2022

  16. [16]

    Learning representations for counterfactual inference

    Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning, 2016

  17. [17]

    MIMIC-III, a freely accessible critical care database

    Alistair Johnson, Tom Pollard, Lu Shen, Li-wei Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger Mark. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 2016

  18. [18]

    Causal inference with noisy and missing covariates via matrix factorization

    Nathan Kallus, Xiaojie Mao, and Madeleine Udell. Causal inference with noisy and missing covariates via matrix factorization. Advances in Neural Information Processing Systems, 2018

  19. [19]

    Text and causal inference: A review of using text to remove confounding from causal estimates

    Katherine Keith, David Jensen, and Brendan O’Connor. Text and causal inference: A review of using text to remove confounding from causal estimates. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

  20. [20]

    Text as causal mediators: research design for causal estimates of differential treatment of social groups via language aspects

    Katherine Keith, Douglas Rice, and Brendan O’Connor. Text as causal mediators: research design for causal estimates of differential treatment of social groups via language aspects. In EMNLP 2021 Workshop on Causal Inference and NLP, 2021

  21. [21]

    Towards optimal doubly robust estimation of heterogeneous causal effects

    Edward Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2), 2023

  22. [22]

    Doublemldeep: Estimation of causal effects with multimodal data, 2024

    Sven Klaassen, Jan Teichert-Kluge, Philipp Bach, Victor Chernozhukov, Martin Spindler, and Suhas Vijaykumar. Doublemldeep: Estimation of causal effects with multimodal data, 2024. preprint - arXiv:2402.01785

  23. [23]

    Meta learners for estimating heterogeneous treatment effects using machine learning

    Sören Künzel, Jasjeet Sekhon, Peter Bickel, and Bin Yu. Meta learners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences , 116(10), 2019

  24. [24]

    Llm-driven treatment effect estimation under inference time text confounding, 2025

    Yuchen Ma, Dennis Frauen, Jonas Schweisthal, and Stefan Feuerriegel. Llm-driven treatment effect estimation under inference time text confounding, 2025. preprint - arXiv:2507.02843. 10

  25. [25]

    CausalNLP: A practical toolkit for causal inference with text, 2021

    Arun Maiya. CausalNLP: A practical toolkit for causal inference with text, 2021. preprint - arXiv:2106.08043

  26. [26]

    Bounds on representation-induced confound- ing bias for treatment effect estimation

    Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Bounds on representation-induced confound- ing bias for treatment effect estimation. In Proceedings of the 12th International Conference on Learning Representations, 2024

  27. [27]

    Orthogonal Representation Learning for Estimating Causal Quantities

    Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, and Stefan Feuerriegel. Orthogonal representation learning for estimating causal quantities, 2025. preprint - arXiv:2502.04274

  28. [28]

    On a general class of orthogonal learners for the estimation of heterogeneous treatment effects.arXiv preprint arXiv:2303.12687,

    Pawel Morzywolek, Johan Decruyenaere, and Stijn Vansteelandt. On weighted orthogonal learners for heterogeneous treatment effects, 2024. preprint - arXiv:2303.12687v2

  29. [29]

    Matching with text data: An experimental evaluation of methods for matching documents and of measuring match quality

    Reagan Mozer, Luke Miratrix, Aaron Russell Kaufman, and Jason Anastasopoulos. Matching with text data: An experimental evaluation of methods for matching documents and of measuring match quality. Political Analysis, 28(4), 2020

  30. [30]

    Quasi-oracle estimation of heterogeneous treatment effects

    Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2), 2020

  31. [31]

    Causal effects of linguistic properties

    Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar. Causal effects of linguistic properties. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, 2021

  32. [32]

    Synsum – synthetic benchmark with structured and unstructured medical records

    Paloma Rabaey, Henri Arno, Stefan Heytens, and Thomas Demeester. Synsum – synthetic benchmark with structured and unstructured medical records. In AAAI 2025 Workshop on GenAI4Health, 2024

  33. [33]

    Biolord-2023: semantic textual representations fusing large language models and clinical knowledge graph insights

    François Remy, Kris Demuynck, and Thomas Demeester. Biolord-2023: semantic textual representations fusing large language models and clinical knowledge graph insights. Journal of the American Medical Informatics Association, 31(9), 2024

  34. [34]

    Adjusting for confounding with text matching

    Margaret Roberts, Brandon Stewart, and Richard Nielsen. Adjusting for confounding with text matching. American Journal of Political Science, 64(4), 2020

  35. [35]

    Causal inference using potential outcomes

    Donald Rubin. Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469), 2005

  36. [36]

    Diffusion causal models for counterfactual estimation

    Pedro Sanchez and Sotirios Tsaftaris. Diffusion causal models for counterfactual estimation. InProceedings of the 1st Conference on Causal Learning and Reasoning, 2022

  37. [37]

    Estimating individual treatment effect: generalization bounds and algorithms

    Uri Shalit, Fredrik Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning, 2017

  38. [38]

    Adapting neural networks for the estimation of treatment effects

    Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, 2019

  39. [39]

    I see, therefore i do: Estimating causal effects for image treatments, 2024

    Abhinav Thorat, Ravi Kolla, and Niranjan Pedanekar. I see, therefore i do: Estimating causal effects for image treatments, 2024. preprint - arXiv:2412.06810

  40. [40]

    Adapting text embeddings for causal inference

    Victor Veitch, Dhanya Sridhar, and David Blei. Adapting text embeddings for causal inference. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, 2020

  41. [41]

    Estimation and inference of heterogeneous treatment effects using random forests

    Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 2018

  42. [42]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, 20...

  43. [43]

    Adjusting for confounders with text: Challenges and an empirical evaluation framework for causal inference

    Galen Weld, Peter West, Maria Glenski, David Arbour, Ryan Rossi, and Tim Althoff. Adjusting for confounders with text: Challenges and an empirical evaluation framework for causal inference. In Proceedings of the 15th International AAAI Conference on Web and Social Media, 2022

  44. [44]

    Challenges of using text classifiers for causal inference

    Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. Challenges of using text classifiers for causal inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  45. [45]

    GANITE: Estimation of individualized treatment effects using generative adversarial nets

    Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GANITE: Estimation of individualized treatment effects using generative adversarial nets. In Proceedings of the 6th International Conference on Learning Representations, 2018

  46. [46]

    Optimizing multi-scale representations to detect effect heterogeneity using earth observation and computer vision: Application to two anti-poverty rcts

    Fucheng Warren Zhu, Connor Jerzak, and Adel Daoud. Optimizing multi-scale representations to detect effect heterogeneity using earth observation and computer vision: Application to two anti-poverty rcts. In Proceedings of the 4th Conference on Causal Learning and Reasoning, 2025. 11 Appendix This appendix provides additional technical details to support t...