Computational Phenotype Discovery via Probabilistic Independence

Diego A. Mesa; Thomas A. Lasko

arxiv: 1907.11051 · v1 · pith:6M4XJDO3new · submitted 2019-07-25 · 📊 stat.AP

Computational Phenotype Discovery via Probabilistic Independence

Thomas A. Lasko , Diego A. Mesa This is my paper

Pith reviewed 2026-05-24 15:49 UTC · model grok-4.3

classification 📊 stat.AP

keywords phenotype discoveryprobabilistic independenceelectronic health recordshepatocellular carcinomalongitudinal curvesdisentangling phenotypesEHR transformation

0 comments

The pith

Probabilistic independence disentangles EHR phenotypes into patterns that may match true pathophysiologic mechanisms after transformation to continuous curves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that probabilistic independence can serve as a guiding principle to separate phenotypes from electronic health records so the resulting patterns align more closely with underlying disease mechanisms. Sparse and irregular observations are first converted into continuous longitudinal curves to enable this separation. A sympathetic reader would care because current phenotype discovery methods are often pragmatic and may not reflect biology, whereas this approach offers a more principled route with direct relevance to predicting outcomes such as hepatocellular carcinoma from liver disease patterns.

Core claim

Transformation of sparse irregular EHR observations into continuous longitudinal curves, followed by application of probabilistic independence, allows disentangling of phenotypes into patterns that may more closely match true pathophysiologic mechanisms, demonstrated by identifying liver disease patterns that presage development of Hepatocellular Carcinoma.

What carries the argument

Probabilistic independence as a guiding principle for disentangling phenotypes from continuous longitudinal curves derived from sparse EHR observations.

If this is right

Phenotypes separated this way will align more closely with pathophysiologic mechanisms than those from pragmatic approaches.
The method can identify patterns that presage specific diseases such as Hepatocellular Carcinoma.
Curve transformation makes irregular sparse data amenable to independence-based separation without losing essential longitudinal structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be tested on other episodic observation domains to check whether independence consistently recovers mechanism-level signals.
If the phenotypes prove stable across different curve-fitting choices, the approach would gain robustness for clinical deployment.
Linking the discovered patterns directly to genomic or proteomic data would provide an independent test of whether they reflect true mechanisms.

Load-bearing premise

Converting sparse EHR observations into continuous longitudinal curves preserves the information needed for independence-based separation to reflect pathophysiologic mechanisms rather than artifacts of the transformation.

What would settle it

Finding that the independence-derived phenotypes correspond to known transformation artifacts or fail to predict Hepatocellular Carcinoma incidence better than pragmatic baselines in held-out EHR data would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.11051 by Diego A. Mesa, Thomas A. Lasko.

**Figure 2.** Figure 2: Data-driven phenotypes include surprisingly detailed distinctions between early and late disease (a through e), and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Computational Phenotype Discovery research has taken various pragmatic approaches to disentangling phenotypes from the episodic observations in Electronic Health Records. In this work, we use transformation into continuous, longitudinal curves to abstract away the sparse irregularity of the data, and we introduce probabilistic independence as a guiding principle for disentangling phenotypes into patterns that may more closely match true pathophysiologic mechanisms. We use the identification of liver disease patterns that presage development of Hepatocellular Carcinoma as a proof-of-concept demonstration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs curve smoothing of EHR data with probabilistic independence to discover phenotypes, but the abstract supplies no methods or validation details so the claim stays untested.

read the letter

The core idea is to abstract sparse EHR observations into continuous curves and then apply probabilistic independence as the criterion for separating phenotypes, with the hope that the resulting patterns track real pathophysiologic mechanisms better than pragmatic clustering. They demonstrate it on patterns that precede hepatocellular carcinoma in liver disease patients. That specific pairing is not just a restatement of earlier independence work; the longitudinal curve step is a deliberate choice to handle irregularity before the independence criterion is applied. The proof-of-concept framing is straightforward and the motivation is clear. The main limitation is that the abstract contains no equations, no description of how independence is operationalized or measured, and no validation or error analysis. Without those, it is impossible to judge whether the separation reflects biology or properties introduced by the smoothing and interpolation. The stress-test point about transformation artifacts is therefore live until the methods section shows otherwise. The paper is aimed at researchers who build phenotype algorithms from real-world clinical data. A reader already working on disentangling factors in longitudinal records could extract a usable idea if the full methods deliver the missing technical steps. It deserves peer review because the framing is coherent and the application area matters, even though the current evidence level is low.

Referee Report

2 major / 1 minor

Summary. The paper claims that transforming sparse and irregular Electronic Health Record observations into continuous longitudinal curves, followed by the application of probabilistic independence as a guiding principle, enables the disentangling of phenotypes into patterns that may more closely match true pathophysiologic mechanisms, demonstrated via a proof-of-concept on liver disease patterns that presage development of Hepatocellular Carcinoma.

Significance. If the result holds after addressing the transformation step, the work would provide a principled alternative to pragmatic phenotype discovery methods in computational medicine by leveraging probabilistic independence post-transformation, with potential for improved alignment with underlying biology in EHR-based studies.

major comments (2)

[Methods (data transformation)] The central claim requires that the mapping from sparse EHR observations to continuous longitudinal curves preserves information such that independence-based separation reflects pathophysiologic mechanisms. No analysis, sensitivity test, or theoretical argument is provided to establish that the smoothing/interpolation does not impose correlation structures that the independence criterion then exploits artifactually rather than recovering biological signals. This is load-bearing for the claim.
[Results] The proof-of-concept demonstration on liver disease patterns does not include quantitative metrics, error analysis, or comparison against baseline phenotype discovery approaches to substantiate that the resulting patterns align with mechanisms beyond what the curve construction itself produces.

minor comments (1)

[Abstract] The abstract contains no equations, implementation details, or validation metrics, which limits the ability to evaluate the technical soundness from the outset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments highlight important aspects that will improve the clarity and rigor of our work. We respond to each major comment below, indicating the revisions we intend to make.

read point-by-point responses

Referee: [Methods (data transformation)] The central claim requires that the mapping from sparse EHR observations to continuous longitudinal curves preserves information such that independence-based separation reflects pathophysiologic mechanisms. No analysis, sensitivity test, or theoretical argument is provided to establish that the smoothing/interpolation does not impose correlation structures that the independence criterion then exploits artifactually rather than recovering biological signals. This is load-bearing for the claim.

Authors: We agree that this is a load-bearing assumption for the central claim. The submitted manuscript does not contain sensitivity tests or a theoretical argument addressing whether the chosen smoothing or interpolation step could artifactually induce correlations exploited by the independence criterion. In the revised manuscript we will add a dedicated subsection in Methods that reports sensitivity analyses across a range of interpolation parameters (e.g., spline order, Gaussian-process length-scale) and quantifies stability of the recovered independent components. We will also include a short theoretical discussion, based on properties of functional data representations, of the conditions under which the transformation is expected to preserve rather than create independence structure. revision: yes
Referee: [Results] The proof-of-concept demonstration on liver disease patterns does not include quantitative metrics, error analysis, or comparison against baseline phenotype discovery approaches to substantiate that the resulting patterns align with mechanisms beyond what the curve construction itself produces.

Authors: We acknowledge that the current proof-of-concept is primarily qualitative and lacks the requested quantitative support. The manuscript does not report error analyses, stability metrics, or head-to-head comparisons with standard phenotype-discovery baselines. In revision we will augment the Results section with (i) quantitative phenotype-stability measures across data subsamples, (ii) predictive performance of the discovered phenotypes for subsequent HCC onset (e.g., via time-to-event models), and (iii) explicit comparisons against baseline approaches such as PCA or ICA applied directly to the constructed curves and to conventional aggregated-feature clustering. These additions will allow readers to evaluate whether the independence-guided phenotypes provide information beyond the curve-construction step alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independence applied as external principle

full rationale

The provided abstract and context present probabilistic independence as an introduced guiding principle applied after an explicit data transformation step, without any quoted equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its own inputs by construction. No load-bearing steps match the enumerated circularity patterns; the derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that probabilistic independence in the transformed space aligns with pathophysiologic mechanisms; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Probabilistic independence in the transformed curve space corresponds to true pathophysiologic mechanisms
Presented as the guiding principle for disentangling phenotypes in the abstract.

pith-pipeline@v0.9.0 · 5591 in / 1010 out tokens · 34255 ms · 2026-05-24T15:49:11.174009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Anderson

Gary P. Anderson. 2008. Endotyping asthma: new insights into key pathogenic mechanisms in a complex, heterogeneous disease. Lancet 372, 9643 (Sep 2008), 1107–1119

work page 2008
[2]

Zhengpoing Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu

work page
[3]

In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15)

Deep Computational Phenotyping. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15)

work page
[4]

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. 2018. Isolating sources of disentanglement in variational autoencoders. InNIPS 2018. 2610–2620

work page 2018
[5]

Marzyeh Ghassemi, Marco A F Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, and Mengling Feng. 2015. A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data. AAAI 2015 2015 (Jan. 2015), 446–453

work page 2015
[6]

David H. Gutmann. 2014. Eliminating barriers to personalized medicine: Learning from neurofibromatosis type 1. Neurology (Jun 2014)

work page 2014
[7]

Ho, Joydeep Ghosh, and Jimeng Sun

Joyce C. Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In KDD 2014 (KDD ’14). ACM, New York, NY, USA, 115–124

work page 2014
[8]

Aapo Hyvarinen, Juha Karhunen, and Erkki Oja. 2001. Independent Component Analysis. Wiley, New York

work page 2001
[9]

George N Ioannou. 2016. The Role of Cholesterol in the Pathogenesis of NASH. Trends Endocrinol Metab 27 (Feb. 2016), 84–95. Issue 2

work page 2016
[10]

Kale, Zhengping Che, Mohammad Taha Bahadori, Wenzhe Li, Yan Liu, and Randall Wetzel

David C. Kale, Zhengping Che, Mohammad Taha Bahadori, Wenzhe Li, Yan Liu, and Randall Wetzel. 2015. Causal Phenotype Discovery via Deep Networks. In Proceedings AMIA Symposium 2015

work page 2015
[11]

W-P Koh, K Robien, R Wang, S Govindarajan, J-M Yuan, and M C Yu. 2011. Smok- ing as an independent risk factor for hepatocellular carcinoma: the Singapore Chinese Health Study. Br J Cancer 105 (Oct. 2011), 1430–1435. Issue 9

work page 2011
[12]

Thomas A Lasko. 2013. Inferring the Latent Intensity of Clinical Events Using Modulated Renewal Processes. In NIPS 2013 Workshop on Machine Learning for Clinical Data Analysis and Healthcare

work page 2013
[13]

Thomas A. Lasko. 2014. Efficient Inference of Gaussian Process Modulated Renewal Processes with Application to Medical Event Data. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI). arXiv:1402.4732

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Thomas A Lasko. 2015. Nonstationary Gaussian Process Regression for Eval- uating Clinical Laboratory Test Sampling Strategies. AAAI 2015 (Jan 2015), 1777–1783

work page 2015
[15]

Lasko, Joshua C

Thomas A. Lasko, Joshua C. Denny, and Mia A. Levy. 2013. Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data. PLoS One 8, 6 (2013), e66341

work page 2013
[16]

Yang Li, Quan Pan, Suhang Wang, Haiyun Peng, Tao Yang, and Erik Cambria. 2019. Disentangled variational auto-encoder for semi-supervised learning. Information Sciences 482 (2019), 73–85

work page 2019
[17]

Xiao Ma, Yang Yang, Hong Tu, Jing Gao, Yu-Ting Tan, Jia-Li Zheng, Freddie Bray, and Yong-Bing Xiang. 2016. Risk prediction models for hepatocellular carcinoma in different populations. Chin J Cancer Res 28 (April 2016), 150–160. Issue 2

work page 2016
[18]

Kidd, and Joel T

Riccardo Miotto, Li Li, Brian A. Kidd, and Joel T. Dudley. 2016. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci Rep 6 (2016), 26094

work page 2016
[19]

Ringman, Alison Goate, ColinL

JohnM. Ringman, Alison Goate, ColinL. Masters, NigelJ. Cairns, Adrian Danek, Neill Graff-Radford, Bernardino Ghetti, and JohnC. Morris. 2014. Genetic Het- erogeneity in Alzheimer Disease and Implications for Treatment Strategies. Curr Neurol Neurosci Rep 14, 11, Article 499 (2014)

work page 2014
[20]

D. M. Roden, J. M. Pulley, M. A. Basford, G. R. Bernard, E. W. Clayton, J. R. Balser, and D. R. Masys. 2008. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 84, 3 (Sep 2008), 362–369

work page 2008
[21]

María Rojas-Feria, Manuel Castro, Emilio Suárez, Javier Ampuero, and Manuel Romero-Gómez. 2013. Hepatobiliary manifestations in inflammatory bowel disease: the gut, the drugs and the liver. World J Gastroenterol 19 (Nov. 2013), 7327–7340. Issue 42

work page 2013
[22]

Tiinamaija Tuomi, Nicola Santoro, Sonia Caprio, Mengyin Cai, Jianping Weng, and Leif Groop. 2014. The many faces of diabetes: a disease with increasing heterogeneity. Lancet 383, 9922 (2014), 1084–1094

work page 2014
[23]

Robert J Wong, Maria Aguilar, Ramsey Cheung, Ryan B Perumpail, Stephen A Harrison, Zobair M Younossi, and Aijaz Ahmed. 2015. Nonalcoholic steatohep- atitis is the second leading etiology of liver disease among adults awaiting liver transplantation in the United States. Gastroenterology 148 (March 2015), 547–555. Issue 3

work page 2015
[24]

Jiayu Zhou, Fei Wang, Jianying Hu, and Jieping Ye. 2014. From Micro to Macro: Data Driven Phenotyping by Densification of Longitudinal Electronic Medical Records. In KDD 2014 (KDD ’14). ACM, New York, NY, USA, 135–144

work page 2014

[1] [1]

Anderson

Gary P. Anderson. 2008. Endotyping asthma: new insights into key pathogenic mechanisms in a complex, heterogeneous disease. Lancet 372, 9643 (Sep 2008), 1107–1119

work page 2008

[2] [2]

Zhengpoing Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu

work page

[3] [3]

In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15)

Deep Computational Phenotyping. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15)

work page

[4] [4]

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. 2018. Isolating sources of disentanglement in variational autoencoders. InNIPS 2018. 2610–2620

work page 2018

[5] [5]

Marzyeh Ghassemi, Marco A F Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton, Peter Szolovits, and Mengling Feng. 2015. A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data. AAAI 2015 2015 (Jan. 2015), 446–453

work page 2015

[6] [6]

David H. Gutmann. 2014. Eliminating barriers to personalized medicine: Learning from neurofibromatosis type 1. Neurology (Jun 2014)

work page 2014

[7] [7]

Ho, Joydeep Ghosh, and Jimeng Sun

Joyce C. Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In KDD 2014 (KDD ’14). ACM, New York, NY, USA, 115–124

work page 2014

[8] [8]

Aapo Hyvarinen, Juha Karhunen, and Erkki Oja. 2001. Independent Component Analysis. Wiley, New York

work page 2001

[9] [9]

George N Ioannou. 2016. The Role of Cholesterol in the Pathogenesis of NASH. Trends Endocrinol Metab 27 (Feb. 2016), 84–95. Issue 2

work page 2016

[10] [10]

Kale, Zhengping Che, Mohammad Taha Bahadori, Wenzhe Li, Yan Liu, and Randall Wetzel

David C. Kale, Zhengping Che, Mohammad Taha Bahadori, Wenzhe Li, Yan Liu, and Randall Wetzel. 2015. Causal Phenotype Discovery via Deep Networks. In Proceedings AMIA Symposium 2015

work page 2015

[11] [11]

W-P Koh, K Robien, R Wang, S Govindarajan, J-M Yuan, and M C Yu. 2011. Smok- ing as an independent risk factor for hepatocellular carcinoma: the Singapore Chinese Health Study. Br J Cancer 105 (Oct. 2011), 1430–1435. Issue 9

work page 2011

[12] [12]

Thomas A Lasko. 2013. Inferring the Latent Intensity of Clinical Events Using Modulated Renewal Processes. In NIPS 2013 Workshop on Machine Learning for Clinical Data Analysis and Healthcare

work page 2013

[13] [13]

Thomas A. Lasko. 2014. Efficient Inference of Gaussian Process Modulated Renewal Processes with Application to Medical Event Data. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI). arXiv:1402.4732

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

Thomas A Lasko. 2015. Nonstationary Gaussian Process Regression for Eval- uating Clinical Laboratory Test Sampling Strategies. AAAI 2015 (Jan 2015), 1777–1783

work page 2015

[15] [15]

Lasko, Joshua C

Thomas A. Lasko, Joshua C. Denny, and Mia A. Levy. 2013. Computational Phenotype Discovery Using Unsupervised Feature Learning over Noisy, Sparse, and Irregular Clinical Data. PLoS One 8, 6 (2013), e66341

work page 2013

[16] [16]

Yang Li, Quan Pan, Suhang Wang, Haiyun Peng, Tao Yang, and Erik Cambria. 2019. Disentangled variational auto-encoder for semi-supervised learning. Information Sciences 482 (2019), 73–85

work page 2019

[17] [17]

Xiao Ma, Yang Yang, Hong Tu, Jing Gao, Yu-Ting Tan, Jia-Li Zheng, Freddie Bray, and Yong-Bing Xiang. 2016. Risk prediction models for hepatocellular carcinoma in different populations. Chin J Cancer Res 28 (April 2016), 150–160. Issue 2

work page 2016

[18] [18]

Kidd, and Joel T

Riccardo Miotto, Li Li, Brian A. Kidd, and Joel T. Dudley. 2016. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci Rep 6 (2016), 26094

work page 2016

[19] [19]

Ringman, Alison Goate, ColinL

JohnM. Ringman, Alison Goate, ColinL. Masters, NigelJ. Cairns, Adrian Danek, Neill Graff-Radford, Bernardino Ghetti, and JohnC. Morris. 2014. Genetic Het- erogeneity in Alzheimer Disease and Implications for Treatment Strategies. Curr Neurol Neurosci Rep 14, 11, Article 499 (2014)

work page 2014

[20] [20]

D. M. Roden, J. M. Pulley, M. A. Basford, G. R. Bernard, E. W. Clayton, J. R. Balser, and D. R. Masys. 2008. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 84, 3 (Sep 2008), 362–369

work page 2008

[21] [21]

María Rojas-Feria, Manuel Castro, Emilio Suárez, Javier Ampuero, and Manuel Romero-Gómez. 2013. Hepatobiliary manifestations in inflammatory bowel disease: the gut, the drugs and the liver. World J Gastroenterol 19 (Nov. 2013), 7327–7340. Issue 42

work page 2013

[22] [22]

Tiinamaija Tuomi, Nicola Santoro, Sonia Caprio, Mengyin Cai, Jianping Weng, and Leif Groop. 2014. The many faces of diabetes: a disease with increasing heterogeneity. Lancet 383, 9922 (2014), 1084–1094

work page 2014

[23] [23]

Robert J Wong, Maria Aguilar, Ramsey Cheung, Ryan B Perumpail, Stephen A Harrison, Zobair M Younossi, and Aijaz Ahmed. 2015. Nonalcoholic steatohep- atitis is the second leading etiology of liver disease among adults awaiting liver transplantation in the United States. Gastroenterology 148 (March 2015), 547–555. Issue 3

work page 2015

[24] [24]

Jiayu Zhou, Fei Wang, Jianying Hu, and Jieping Ye. 2014. From Micro to Macro: Data Driven Phenotyping by Densification of Longitudinal Electronic Medical Records. In KDD 2014 (KDD ’14). ACM, New York, NY, USA, 135–144

work page 2014