pith. sign in

arxiv: 2511.10888 · v2 · submitted 2025-11-14 · 🧬 q-bio.OT

Multi-omic Enriched Blood-Derived Digital Signatures Reveal Mechanistic and Confounding Disease Clusters for Differential Diagnosis

Pith reviewed 2026-05-17 22:56 UTC · model grok-4.3

classification 🧬 q-bio.OT
keywords digital blood twindisease clusteringblood biomarkershierarchical clusteringcytokine signalinghematological diseasesmechanistic overlapsprecision diagnostics
0
0 comments X

The pith

Blood-derived digital signatures recover clinically meaningful disease clusters and reveal shared inflammatory mechanisms across categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper constructs a digital blood twin computational model from blood analyte profiles across 103 diseases. Profiles are standardized into a disease-analyte matrix and pairwise Pearson correlations are used to build a hierarchical clustering tree that is cut into 16 groups. The largest cluster shows enrichment for cytokine-signaling pathways, indicating shared inflammatory mechanisms that cross traditional disease boundaries, while hematological conditions form a tight group and metabolic or respiratory ones are more scattered. Random Forest analysis flags neutrophils, mean corpuscular volume, red blood cell count, and platelet count as the strongest separating features. A reader would care because the work suggests that everyday lab blood tests contain enough structure to help reorganize how diseases are grouped for diagnosis and to spot overlapping biology.

Core claim

The authors construct a digital blood twin from longitudinal hematological and biochemical analytes across 103 disease signatures. They standardize these into a unified disease-analyte matrix and use pairwise Pearson correlations to measure similarity, followed by hierarchical clustering that partitions the tree into 16 groups at a stringent threshold. Enrichment analysis on the largest heterogeneous cluster points to cytokine-signaling pathways as a common mechanism. PCA and UMAP confirm the separation of hematological diseases, and Random Forest identifies neutrophils, mean corpuscular volume, red blood cell count, and platelet count as top discriminative features.

What carries the argument

The digital blood twin, a computational model based on a standardized disease-analyte matrix and pairwise Pearson correlations followed by hierarchical clustering.

If this is right

  • Hematopoietic disorders form a consistent and distinct cluster.
  • Metabolic, endocrine, and respiratory diseases display weaker internal cohesion and more heterogeneous placement.
  • The largest cluster converges on cytokine-signaling pathways that transcend conventional clinical categories.
  • Neutrophils, mean corpuscular volume, red blood cell count, and platelet count emerge as the most discriminative analytes.
  • Routine laboratory data combined with this network physiology approach can refine disease ontology and map comorbidities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same correlation-and-clustering pipeline could be applied to patient-level longitudinal blood data to test whether it improves differential diagnosis in clinical settings.
  • Extending the matrix to include additional omics layers might expose further mechanistic overlaps not visible from blood analytes alone.
  • The identified clusters could serve as a starting point for predicting co-occurrence risks between diseases that share biomarker profiles.

Load-bearing premise

That pairwise Pearson correlations computed on standardized disease-analyte profiles accurately capture true mechanistic similarities rather than being driven by confounding variables or data collection biases.

What would settle it

Re-running the clustering and enrichment steps on an independent collection of disease profiles or with a different similarity measure such as Spearman correlation and finding that the 16-group partition or the cytokine pathway enrichment disappears.

Figures

Figures reproduced from arXiv: 2511.10888 by Abicumaran Uthamacumaran, Alexander Fulton, Bolin Liu, Hector Zenil.

Figure 1
Figure 1. Figure 1: Complete phylogenetic tree of 103 disease profiles constructed using Pearson correlation distance (1 − ρij ) and UPGMA hierarchical clustering. The y￾axis represents clustering distance [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Phylogenetic tree of 103 disease profiles [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: A) Reactome enrichment dotplot of Cluster 9 diseases. The top pathways include Signaling by interleukins, IL-4/IL-13 signaling, Extracellular matrix organization, Platelet activation/degranulation, and IL-10 signaling. B) Reactome cnetplot of the top 8 pathways, highlighting shared genes linking cytokine signaling, platelet function, and extracellular matrix remodeling. C) Reactome emapplot (cutoff = 0.35)… view at source ↗
Figure 6
Figure 6. Figure 6: Random Forest analysis of feature importance. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Understanding disease relationships through blood biomarkers offers a pathway toward data-driven taxonomy and precision medicine. In this study, we constructed a digital blood twin, a computational model derived from 103 disease signatures comprising longitudinal hematological and biochemical analytes. Profiles were standardized into a unified disease-analyte matrix, and pairwise Pearson correlations were computed to assess similarity across conditions. Hierarchical clustering revealed consistent grouping of hematopoietic disorders, while metabolic, endocrine, and respiratory diseases were more heterogeneous, reflecting weaker internal cohesion. To evaluate cluster structure, the tree was partitioned at a stringent distance threshold, yielding 16 groups. Enrichment analysis of the largest and most heterogeneous cluster demonstrated convergence on cytokine-signaling pathways, indicating shared inflammatory mechanisms that transcend conventional clinical boundaries. PCA and UMAP corroborated the correlation-based results, consistently separating hematological diseases as a distinct cluster. Random Forest feature selection identified neutrophils, mean corpuscular volume, red blood cell count, and platelet count as the most discriminative analytes, reinforcing the role of hematopoietic markers as key drivers of disease stratification. Collectively, these findings show that blood-derived digital signatures can recover clinically meaningful disease clusters while uncovering mechanistic overlaps across categories. This network physiology framework highlights the potential of integrating routine laboratory data with computational methods to refine disease ontology, map comorbidities, and advance precision diagnostics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper constructs a 'digital blood twin' computational model from 103 disease signatures using longitudinal hematological and biochemical analytes. Disease profiles are standardized into a unified matrix, pairwise Pearson correlations are computed to measure similarity, and hierarchical clustering with a fixed distance threshold partitions the data into 16 groups. Enrichment analysis on the largest heterogeneous cluster identifies convergence on cytokine-signaling pathways. PCA, UMAP, and Random Forest feature selection corroborate separation of hematological diseases and highlight neutrophils, mean corpuscular volume, red blood cell count, and platelet count as key discriminative analytes. The central claim is that blood-derived signatures recover clinically meaningful clusters and uncover mechanistic overlaps transcending conventional disease categories.

Significance. If the correlations and clusters reflect true mechanistic similarities rather than artifacts, the work could provide a data-driven network physiology framework for refining disease ontology, mapping comorbidities, and advancing precision diagnostics using routine laboratory data. It would demonstrate the utility of unsupervised methods on standardized analyte profiles for identifying shared inflammatory mechanisms across metabolic, endocrine, respiratory, and other categories.

major comments (2)
  1. [Abstract (pipeline description) and Results (clustering and enrichment)] The pipeline (standardization, Pearson correlations, hierarchical clustering at a fixed distance threshold to yield 16 groups, followed by enrichment) provides no evidence of covariate adjustment, matching, or sensitivity analyses for potential confounders such as age, sex, BMI, medications, or batch effects. This directly undermines the claim that observed clusters and cytokine-signaling enrichment represent mechanistic overlaps rather than data collection biases or unmodeled variables.
  2. [Abstract and Methods] No sample sizes, data sources, statistical thresholds for enrichment, multiple-testing corrections, or error estimates are reported for the 103 signatures or the 16 groups. Without these, it is impossible to evaluate the robustness of the hematological separation or the cross-category convergence.
minor comments (2)
  1. [Abstract] The term 'digital blood twin' is introduced without a precise mathematical definition or comparison to existing digital twin concepts in the literature.
  2. [Results (hierarchical clustering)] Clarify whether the distance threshold for tree partitioning was chosen a priori or post hoc, and report sensitivity of the 16-group structure to small changes in this threshold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate revisions to be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: The pipeline (standardization, Pearson correlations, hierarchical clustering at a fixed distance threshold to yield 16 groups, followed by enrichment) provides no evidence of covariate adjustment, matching, or sensitivity analyses for potential confounders such as age, sex, BMI, medications, or batch effects. This directly undermines the claim that observed clusters and cytokine-signaling enrichment represent mechanistic overlaps rather than data collection biases or unmodeled variables.

    Authors: We agree that explicit treatment of potential confounders is necessary to support mechanistic interpretations. The 103 signatures were compiled from published aggregate data and public repositories rather than raw individual-level records, which limits direct covariate adjustment. In the revised manuscript we will add a new subsection in Methods describing the provenance of each signature and any available metadata on demographics or batch information. We will also report sensitivity analyses (e.g., re-clustering after excluding signatures with known age or sex imbalance where metadata permit) and expand the Discussion to quantify how unmodeled variables could affect cluster stability and enrichment results. These additions will make the limitations transparent while preserving the core finding that blood-analyte patterns recover reproducible groupings. revision: yes

  2. Referee: No sample sizes, data sources, statistical thresholds for enrichment, multiple-testing corrections, or error estimates are reported for the 103 signatures or the 16 groups. Without these, it is impossible to evaluate the robustness of the hematological separation or the cross-category convergence.

    Authors: We acknowledge that these quantitative details were omitted from the initial submission. The revised Methods section will now list, for each of the 103 signatures, the source study or database, the number of independent samples or patients contributing to the signature, and the number of analytes measured. For the enrichment analysis we will specify the exact statistical test, the FDR threshold employed (e.g., Benjamini-Hochberg FDR < 0.05), and any multiple-testing correction applied across the 16 clusters. Cluster stability will be quantified by reporting bootstrap or permutation-based error estimates on the cophenetic distances and on the Random Forest feature importances. These additions will allow readers to assess the statistical support for the reported separations and convergences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard data-analysis pipeline

full rationale

The paper constructs a disease-analyte matrix from 103 signatures, standardizes it, computes pairwise Pearson correlations, applies hierarchical clustering (partitioned at a fixed distance threshold to yield 16 groups), runs enrichment, PCA, UMAP, and Random Forest feature selection. All steps are direct, off-the-shelf applications of established algorithms to the input data; no equations reduce outputs to fitted parameters by construction, no self-definitional loops, and no load-bearing self-citations or uniqueness theorems are invoked. The central claims are empirical results of this pipeline rather than tautological restatements of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that correlation-based similarity in blood analyte profiles reflects underlying biology, plus the post-hoc choice of a stringent distance threshold to define 16 clusters and the interpretation of enrichment as mechanistic convergence.

free parameters (1)
  • distance threshold for tree partitioning
    Used to yield exactly 16 groups; described as stringent but no numerical value or selection criterion provided.
axioms (1)
  • domain assumption Pearson correlation on standardized analyte profiles measures biologically meaningful disease similarity
    Invoked when constructing the similarity matrix and interpreting clusters as mechanistic overlaps.
invented entities (1)
  • digital blood twin no independent evidence
    purpose: Computational model derived from longitudinal hematological and biochemical disease signatures
    New term introduced for the unified disease-analyte matrix and derived signatures; no independent falsifiable prediction or external validation supplied.

pith-pipeline@v0.9.0 · 5543 in / 1088 out tokens · 47039 ms · 2026-05-17T22:56:03.499895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    The role of blood testing in prevention, diagnosis, and management of chronic diseases: A review

    Cabalar I, Le TH, Silber A, O’Hara M, Abdallah B, Parikh M, et al. The role of blood testing in prevention, diagnosis, and management of chronic diseases: A review. The American Journal of the Medical Sciences. 2024;368(4):274-86

  2. [2]

    Hoffbrand’s essential haematology

    Hoffbrand AV. Hoffbrand’s essential haematology. John Wiley & Sons; 2024

  3. [3]

    Clinical proteomics: written in blood

    Liotta LA, Ferrari M, Petricoin E. Clinical proteomics: written in blood. Nature. 2003;425(6961):905- 5

  4. [4]

    Digital twin in healthcare: Recent updates and challenges

    Sun T, He X, Li Z. Digital twin in healthcare: Recent updates and challenges. Digital health. 2023;9:20552076221149651

  5. [5]

    Digital twin for healthcare systems

    Vall´ee A. Digital twin for healthcare systems. Frontiers in Digital Health. 2023;5:1253050

  6. [6]

    The digital twin revolution in healthcare

    Erol T, Mendi AF, Do ˘gan D. The digital twin revolution in healthcare. In: 2020 4th international symposium on multidisciplinary studies and innovative technologies (ISMSIT). IEEE; 2020. p. 1-7

  7. [7]

    Digital twins as global learning health and disease models for preventive and personalized medicine

    Li X, Loscalzo J, Mahmud AF, Aly DM, Rzhetsky A, Zitnik M, et al. Digital twins as global learning health and disease models for preventive and personalized medicine. Genome Medicine. 2025;17(1):11

  8. [8]

    International statistical classification of diseases and related health prob- lems: 10th revision (ICD -10)

    Organization WH, et al. International statistical classification of diseases and related health prob- lems: 10th revision (ICD -10). http://www who int/classifications/apps/icd/icd. 1992

  9. [9]

    International classification of diseases

    WHO O. International classification of diseases. WHO [Internet]. 1992

  10. [10]

    The human disease network

    Goh KI, Cusick ME, V alle D, Childs B, Vidal M, Barab ´asi AL. The human disease network. Pro- ceedings of the National Academy of Sciences. 2007;104(21):8685-90

  11. [11]

    A dynamic network approach for the study of human phenotypes

    Hidalgo CA, Blumm N, Barab ´asi AL, Christakis NA. A dynamic network approach for the study of human phenotypes. PLoS computational biology. 2009;5(4):e1000353

  12. [12]

    The potential of the Medical Digital Twin in diabetes management: a review

    Chu Y , Li S, Tang J, Wu H. The potential of the Medical Digital Twin in diabetes management: a review. Frontiers in Medicine. 2023;10:1178912

  13. [13]

    Mastering regular expressions

    Friedl J. Mastering regular expressions. ” O’Reilly Media, Inc.”; 2006

  14. [14]

    Pearson K. VII. Mathematical contributions to the theory of evolution. —III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society 20 of London Series A, containing papers of a mathematical or physical character. 1896;(187):253-318

  15. [15]

    A statistical method for evaluating systematic relationships

    Sokal RR, Michener CD, et al. A statistical method for evaluating systematic relationships. 1958

  16. [16]

    Disease Ontology: a backbone for disease semantic integration

    Schriml LM, Arze C, Nadendla S, Chang YWW, Mazaitis M, Felix V , et al. Disease Ontology: a backbone for disease semantic integration. Nucleic acids research. 2012;40(D1):D940-6

  17. [17]

    Disease Ontology; 2025

    Project TDO. Disease Ontology; 2025. Accessed: 2025-07. https://disease- ontology.org/

  18. [18]

    The DisGeNET knowledge platform for disease genomics: 2019 update

    Pin˜ero J, Ram´ırez-Anguita JM, Sa u¨ch-Pitarch J, Ronzano F, Centeno E, Sanz F, et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic acids research. 2020;48(D1):D845-55

  19. [19]

    KEGG: kyoto encyclopedia of genes and genomes

    Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research. 2000;28(1):27-30

  20. [20]

    The reactome pathway knowledgebase 2022

    Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff -Ribeiro A, et al. The reactome pathway knowledgebase 2022. Nucleic acids research. 2022;50(D1):D687-92

  21. [21]

    Using clusterProfiler to characterize multiomics data

    Xu S, Hu E, Cai Y , Xie Z, Luo X, Zhan L, et al. Using clusterProfiler to characterize multiomics data. Nature protocols. 2024;19(11):3292-320

  22. [22]

    ReactomePA: an R/Bioconductor package for reactome pathway analysis and visu- alization

    Yu G, He QY . ReactomePA: an R/Bioconductor package for reactome pathway analysis and visu- alization. Molecular BioSystems. 2016;12(2):477-9

  23. [23]

    Controlling the false discovery rate: a practical and powerful approach to multiple testing

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological). 1995;57(1):289- 300

  24. [24]

    DisGeNET; 2025

    project TD. DisGeNET; 2025. Accessed: 2025-07. https://www.disgenet.org/

  25. [25]

    Statistical analysis with missing data

    Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons; 2019

  26. [26]

    Principal component analysis: a review and recent developments

    Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philo- sophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences. 2016;374(2065):20150202

  27. [27]

    Umap: Uniform manifold approximation and projection for dimen- sion reduction

    McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimen- sion reduction. arXiv preprint arXiv:180203426. 2018

  28. [28]

    Random forests

    Breiman L. Random forests. Machine learning. 2001;45(1):5-32

  29. [29]

    Thyroid disorders and diabetes mellitus

    Hage M, Zantout MS, Azar ST. Thyroid disorders and diabetes mellitus. Journal of thyroid research. 2011;2011(1):439463

  30. [30]

    Anemia of chronic disease

    Weiss G, Goodnough LT. Anemia of chronic disease. New England Journal of Medicine. 2005;352(10):1011-23

  31. [31]

    Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data

    De Winter JC, Gosling SD, Potter J. Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychological methods. 2016;21(3):273. 21

  32. [32]

    Learning similarity with cosine similarity ensemble

    Xia P , Zhang L, Li F. Learning similarity with cosine similarity ensemble. Information sciences. 2015;307:39-52

  33. [33]

    Visualizing data using t -SNE

    Maaten Lvd, Hinton G. Visualizing data using t -SNE. Journal of machine learning research. 2008;9(Nov):2579-605

  34. [34]

    A unified approach to interpreting model predictions

    Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30

  35. [35]

    ” Why should i trust you?” Explaining the predictions of any clas- sifier

    Ribeiro MT, Singh S, Guestrin C. ” Why should i trust you?” Explaining the predictions of any clas- sifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135- 44

  36. [36]

    Accessed: 2025-08-29

    Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation); 2016. Accessed: 2025-08-29. https://gdpr-info.eu/

  37. [37]

    An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems

    Zenil H, Kiani NA, Marabita F, Deng Y , Elias S, Schmidt A, Ball G, Tegnér J. An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems. iScience. 2019;19:1160–1172. doi:10.1016/j.isci.2019.07.043

  38. [38]

    A Review of Mathematical and Computational Methods in Cancer Dynamics

    Uthamacumaran A, Zenil H. A Review of Mathematical and Computational Methods in Cancer Dynamics. Frontiers in Oncology. 2022;12:850731. doi:10.3389/fonc.2022.850731

  39. [39]

    Algorithmic Information Dynamics: A Computational Approach to Causality with Applications to Living Systems

    Zenil H, Kiani NA, Tegnér J. Algorithmic Information Dynamics: A Computational Approach to Causality with Applications to Living Systems. Cambridge University Press; 2023

  40. [40]

    Emergence and algorithmic information dynamics of systems and observers

    Abrahão FS, Zenil H. Emergence and algorithmic information dynamics of systems and observers. Philosophical Transactions of the Royal Society A. 2022;380:20200429. doi:10.1098/rsta.2020.0429. 22 Supplementary Material Table 2: Disease profiles grouped by hierarchical clustering (cut at distance = 0.02). Profile IDs are omitted for clarity; only disease na...