pith. sign in

arxiv: 2502.05872 · v4 · submitted 2025-02-09 · 🧬 q-bio.PE

Flexible inference of evolutionary accumulation dynamics using uncertain observational data

Pith reviewed 2026-05-23 03:35 UTC · model grok-4.3

classification 🧬 q-bio.PE
keywords evolutionary pathwayshypercubic inferenceuncertain dataaccumulation dynamicspathway inferencemultidrug resistancetuberculosisobservational data
0
0 comments X

The pith

HyperLAU infers evolutionary accumulation pathways from data even when up to half the features are uncertain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HyperLAU, an algorithm that extends hypercubic inference to datasets containing uncertainties, allowing the use of cross-sectional, phylogenetic, and longitudinal observations without discarding uncertain entries. It demonstrates that the method recovers the main pathways identified by prior tools when as many as 50 percent of input features are uncertain. A tuberculosis multidrug resistance example shows that the approach yields additional pathway information while avoiding biases introduced by simply excluding uncertain data. Sympathetic readers care because many real-world evolutionary datasets are sparse or noisy, and methods that tolerate uncertainty can draw on larger portions of available evidence.

Core claim

HyperLAU is a new algorithm for hypercubic inference that learns dynamic pathways and feature interactions from data that includes uncertainties, even when large sets of particular features remain unobserved across the source dataset. It is shown to highlight the main pathways recovered by other tools when up to 50 percent of the features in the input data are uncertain and to reduce biases that arise when uncertain portions of the data are simply excluded.

What carries the argument

HyperLAU algorithm, which extends hypercubic inference models to incorporate uncertain observational data for pathway learning.

If this is right

  • Datasets with up to 50 percent uncertain features can still be used to infer main evolutionary pathways.
  • Biases introduced by excluding uncertain data entries can be reduced or avoided.
  • Additional information on evolutionary pathways becomes available in applications such as multidrug resistance analysis.
  • Cross-sectional, phylogenetic, and longitudinal data sources can be combined even when individual features carry uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same uncertainty-handling logic could be adapted to other accumulation models outside the hypercubic setting.
  • Medical datasets with noisy or incomplete observations may benefit from similar flexible inference without forced data pruning.
  • Testing on synthetic datasets with controlled uncertainty levels would provide direct checks on pathway recovery rates.
  • The approach may allow retrospective re-analysis of existing evolutionary studies that previously discarded uncertain observations.

Load-bearing premise

The hypercubic model structure continues to represent the underlying evolutionary process accurately when large fractions of the input features are uncertain or unobserved.

What would settle it

Application of HyperLAU to a dataset with known accumulation pathways where the recovered pathways diverge from those found by other tools once 40-50 percent of features are marked uncertain.

Figures

Figures reproduced from arXiv: 2502.05872 by Iain G. Johnston, Jessica Renz, Morten Brun.

Figure 1
Figure 1. Figure 1: HyperLAU workflow. Learning evolutionary trajectories on a hypercube, based on data that contains uncertainties. (A) Dataset (structure can be cross-sectional or longitudinal) that contains information about the presence (red/dark) or absence (green/gradient) of certain features. White boxes indicate missing/uncertain information. (B) Translation of the data into binary barcodes, 1 = presence of the featur… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of evolutionary pathways inferred by HyperLAU based on some toy examples. Plots show inferred transition networks through the evolutionary state space from 000... (top) to 111... (bottom) (as in Figure 1D). In all plots, the thickness of the edges represents the probability flux between the corresponding state nodes (all coloured edges have minimum 0.05). Coefficient of variation (CV) is illu… view at source ↗
Figure 3
Figure 3. Figure 3: Visualisation of evolutionary pathways learned by HyperLAU based on the tuber￾culosis dataset [Casali et al., 2014]. Plots show inferred transition networks through the evolutionary state space from 000... (top) to 111... (bottom) (as in Figure 1D). In all plots, the thickness of the edges represents the probability flux between the corresponding state nodes (all coloured edges have minimum 0.05). Coeffici… view at source ↗
Figure 4
Figure 4. Figure 4: Inference of anti-microbial evolution in tuberculosis based on a full dataset in￾cluding uncertainties. (A) Visualisation of the used dataset from [Casali et al., 2014] embedded in a phylogeny. Each row in the matrix corresponds to a bacterial isolate that is a tip in the phylogeny. Each column in the matrix describes resistance to a different drug: red fields in the profile represent missing data, white f… view at source ↗
read the original abstract

Understanding and predicting evolutionary accumulation pathways is a key objective in many fields of research, ranging from classical evolutionary biology to diverse applications in medicine. In this context, we are often confronted with the problem that data is sparse and uncertain. To use the available data as best as possible, inference approaches that can handle this uncertainty are required. One way that allows us to use not only cross-sectional data, but also phylogenetic related and longitudinal data, is using `hypercubic inference' models. In this article we introduce HyperLAU, a new algorithm for hypercubic inference that makes it possible to use datasets including uncertainties for learning evolutionary pathways. Expanding the flexibility of accumulation modelling, HyperLAU allows us to infer dynamic pathways and interactions between features, even when large sets of particular features are unobserved across the source dataset. We show that HyperLAU is able to highlight the main pathways found by other tools, even when up to 50% of the features in the input data are uncertain. Additionally, we demonstrate how it can help to overcome possible biases that can occur then reducing the used data by excluding uncertain parts. We illustrate the approach with a case study on multidrug resistance in tuberculosis, showing that HyperLAU allows more flexible data and provides new information about evolutionary pathways compared to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces HyperLAU, a new algorithm extending hypercubic inference models to accommodate uncertain or missing observational data (cross-sectional, phylogenetic, or longitudinal) when inferring evolutionary accumulation pathways and feature interactions. It claims that HyperLAU recovers the dominant pathways identified by existing tools even when up to 50% of input features are uncertain, mitigates bias from simply discarding uncertain observations, and yields additional pathway information on a multidrug-resistance tuberculosis dataset.

Significance. If the uncertainty-handling mechanism proves robust, the work could meaningfully expand the usable data volume for accumulation modeling in evolutionary biology and clinical microbiology. The approach directly targets a common practical limitation (sparse/uncertain features) rather than assuming complete observations. However, the significance is currently limited by the absence of controlled validation or analytic guarantees beyond a single real-world case study.

major comments (3)
  1. [Abstract] Abstract and introduction: the central empirical claim (recovery of main pathways with up to 50% uncertain features on the TB dataset) rests on a single case study without reported controlled simulations that systematically vary the fraction, correlation structure, or type of uncertainty while holding the true accumulation graph fixed. This leaves open whether the hypercubic model structure plus the chosen uncertainty encoding introduces systematic bias under the stated conditions.
  2. [Abstract] Abstract: no analytic bound, parameter-free derivation, or error-propagation analysis is referenced for how uncertainty in individual features propagates through the hypercubic inference procedure; the manuscript therefore provides no a-priori reason to expect the 50% threshold to be general rather than dataset-specific.
  3. [Abstract] The manuscript states that HyperLAU 'allows more flexible data' and 'provides new information' compared with existing tools, yet supplies no quantitative comparison (e.g., pathway overlap metrics, false-positive rates on simulated graphs, or sensitivity to the uncertainty representation) that would establish these improvements are not artifacts of the particular TB dataset or the chosen baseline tools.
minor comments (2)
  1. [Abstract] The abstract refers to 'hypercubic inference' models without a brief reminder of the underlying state-space construction or the precise meaning of 'accumulation dynamics'; a one-sentence definition would aid readers unfamiliar with the prior literature.
  2. [Abstract] The claim that the method 'overcome[s] possible biases' from data reduction is stated without specifying how bias is measured or what the baseline bias magnitude is on the TB example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below, acknowledging the limitations in the current version and outlining specific revisions to address them.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central empirical claim (recovery of main pathways with up to 50% uncertain features on the TB dataset) rests on a single case study without reported controlled simulations that systematically vary the fraction, correlation structure, or type of uncertainty while holding the true accumulation graph fixed. This leaves open whether the hypercubic model structure plus the chosen uncertainty encoding introduces systematic bias under the stated conditions.

    Authors: We agree that the current empirical support relies primarily on the TB case study and that controlled simulations would strengthen the claims. In the revised manuscript we will add a dedicated simulation study section that systematically varies the fraction of uncertain features (including around the 50% level), their correlation structure, and uncertainty types, using fixed ground-truth accumulation graphs. This will allow direct assessment of potential bias introduced by the uncertainty encoding. revision: yes

  2. Referee: [Abstract] Abstract: no analytic bound, parameter-free derivation, or error-propagation analysis is referenced for how uncertainty in individual features propagates through the hypercubic inference procedure; the manuscript therefore provides no a-priori reason to expect the 50% threshold to be general rather than dataset-specific.

    Authors: The manuscript does not currently include analytic bounds or formal error-propagation analysis, as the emphasis was on algorithmic extension and practical application. We will add a section in the revision that analyzes uncertainty propagation through the hypercubic model, including any available bounds or sensitivity results, to provide additional justification for the observed performance levels. revision: yes

  3. Referee: [Abstract] The manuscript states that HyperLAU 'allows more flexible data' and 'provides new information' compared with existing tools, yet supplies no quantitative comparison (e.g., pathway overlap metrics, false-positive rates on simulated graphs, or sensitivity to the uncertainty representation) that would establish these improvements are not artifacts of the particular TB dataset or the chosen baseline tools.

    Authors: We acknowledge the absence of quantitative metrics comparing pathway recovery, false-positive rates, and sensitivity across methods. The revised manuscript will incorporate such comparisons, using both the TB dataset and the new simulations, to quantify improvements in pathway overlap and robustness relative to baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithm with empirical case-study validation

full rationale

The manuscript introduces HyperLAU as a new inference algorithm extending hypercubic accumulation models to handle uncertain or missing features. No derivation chain, parameter fitting, or prediction step is shown to reduce tautologically to its own inputs or to a self-citation. Performance claims rest on an empirical demonstration that the algorithm recovers pathways identified by other tools on one TB multidrug-resistance dataset, which constitutes independent validation rather than a self-referential loop. The central contribution is therefore algorithmic and observational, not a closed mathematical identity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no details on parameters, axioms, or new entities are provided.

pith-pipeline@v0.9.0 · 5761 in / 939 out tokens · 22893 ms · 2026-05-23T03:35:10.563065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. An Algebraic Approach to Evolutionary Accumulation Models

    stat.AP 2025-11 unverdicted novelty 6.0

    An algebraic approach defines semi-algebraic parameter sets from underlying polynomial structures in evolutionary processes before likelihood maximization, showing compatibility with existing statistical EvAM models w...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper

  1. [1]

    Aga, O. N. L., Brun, M., Dauda, K. A., Diaz-Uriarte, R., Giannakis, K., and Johnston, I. G. (2024). HyperTraPS - CT : Inference and prediction for accumulation pathways with flexible data and model structures. PLOS Computational Biology , 20(9):e1012393

  2. [2]

    Beerenwinkel, N., Eriksson, N., and Sturmfels, B. (2006). Evolution on distributive lattices. Journal of Theoretical Biology , 242(2):409--420

  3. [3]

    Beerenwinkel, N., Eriksson, N., and Sturmfels, B. (2007). Conjunctive Bayesian networks. Bernoulli , 13(4):893--909

  4. [4]

    F., Gerstung, M., and Markowetz, F

    Beerenwinkel, N., Schwarz, R. F., Gerstung, M., and Markowetz, F. (2015). Cancer Evolution : Mathematical Models and Computational Inference . Systematic Biology , 64(1):e1--e25

  5. [5]

    and Sullivant, S

    Beerenwinkel, N. and Sullivant, S. (2009). Markov models for accumulating mutations. Biometrika , 96(3):645--661

  6. [6]

    R., Ignatyeva, O., Kontsevaya, I., Corander, J., Bryant, J., Parkhill, J., Nejentsev, S., Horstmann, R

    Casali, N., Nikolayevskyy, V., Balabanova, Y., Harris, S. R., Ignatyeva, O., Kontsevaya, I., Corander, J., Bryant, J., Parkhill, J., Nejentsev, S., Horstmann, R. D., Brown, T., and Drobniewski, F. (2014). Evolution and transmission of drug-resistant tuberculosis in a Russian population. Nature Genetics , 46(3):279--286

  7. [7]

    Chen, J. (2023). Timed hazard networks: Incorporating temporal difference for oncogenetic analysis. PLOS ONE , 18(3):e0283004

  8. [8]

    and Nepusz, T

    Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal , Complex Systems:1695

  9. [9]

    Csárdi, G., Nepusz, T., Traag, V., Horvát, S., Zanini, F., Noom, D., and Müller, K. (2024). \ igraph\ : Network Analysis and Visualization in R

  10. [10]

    O., Wu, H., Safa Erenay, F., Sir, M

    Dalgıç, Ö. O., Wu, H., Safa Erenay, F., Sir, M. Y., Özaltın, O. Y., Crum, B. A., and Pasupathy, K. S. (2021). Mapping of critical events in disease progression through binary classification: Application to amyotrophic lateral sclerosis. Journal of Biomedical Informatics , 123:103895

  11. [11]

    H., and Schäffer, A

    Desper, R., Jiang, F., Kallioniemi, O.-P., Moch, H., Papadimitriou, C. H., and Schäffer, A. A. (1999). Inferring Tree Models for Oncogenesis from Comparative Genome Hybridization Data . Journal of Computational Biology , 6(1):37--51

  12. [12]

    Diaz-Uriarte, R and Johnston, I. (2025). A picture guide to cancer progression and monotonic accumulation models: evolutionary assumptions, plausible interpretations, and alternative uses. IEEE Access

  13. [13]

    Diaz-Uriarte, R and Herrera-Nieto, P. (2022). EvAM-Tools: tools for evolutionary accumulation and cancer progression models. uses. Bioinformatics , 38(24), 5457--5459

  14. [14]

    and Vasallo, C

    Diaz-Uriarte, R. and Vasallo, C. (2019). Every which way? On predicting tumor evolution using cancer progression models. PLOS Computational Biology , 15(8):e1007246

  15. [15]

    Gao, Y., Gaither, J., Chifman, J., and Kubatko, L. (2022). A phylogenetic approach to inferring the order in which mutations arise during cancer progression. PLOS Computational Biology , 18(12):e1010560

  16. [16]

    Gerstung, M., Baudis, M., Moch, H., and Beerenwinkel, N. (2009). Quantifying cancer progression with conjunctive Bayesian networks. Bioinformatics , 25(21):2809--2815

  17. [17]

    Gotovos, A., Burkholz, R., Quackenbush, J., and Jegelka, S. (2021). Scaling up Continuous - Time Markov Chains Helps Resolve Underspecification . In Advances in Neural Information Processing Systems , volume 34, pages 14580--14592. Curran Associates, Inc

  18. [18]

    F., Barahona, M., and Johnston, I

    Greenbury, S. F., Barahona, M., and Johnston, I. G. (2020). HyperTraPS : Inferring Probabilistic Patterns of Trait Acquisition in Evolutionary and Disease Progression Pathways . Cell Systems , 10(1):39--51.e10

  19. [19]

    Hjelm, M., Höglund, M., and Lagergren, J. (2006). New Probabilistic Network Models and Algorithms for Oncogenesis . Journal of Computational Biology , 13(4):853--865

  20. [20]

    Johnston, I. G. and Diaz-Uriarte, R. (2024). A hypercubic Mk model framework for capturing reversibility in disease, cancer, and evolutionary accumulation modelling. Publication Title: bioRxiv

  21. [21]

    G., Hoffmann, T., Greenbury, S

    Johnston, I. G., Hoffmann, T., Greenbury, S. F., Cominetti, O., Jallow, M., Kwiatkowski, D., Barahona, M., Jones, N. S., and Casals-Pascual, C. (2019). Precision identification of high-risk phenotypes and progression pathways in severe malaria without requiring longitudinal data. npj Digital Medicine , 2(1):1--9

  22. [22]

    Johnston, I. G. and Williams, B. P. (2016). Evolutionary Inference across Eukaryotes Identifies Specific Pressures Favoring Mitochondrial Gene Retention . Cell Systems , 2(2):101--111

  23. [23]

    Kassambara, A. (2023). ggpubr: 'ggplot2' Based Publication Ready Plots

  24. [24]

    Lewis, P. O. (2001). A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data . Systematic Biology , 50(6):913--925

  25. [25]

    G., Kuipers, J., and Beerenwinkel, N

    Luo, X. G., Kuipers, J., and Beerenwinkel, N. (2023). Joint inference of exclusivity patterns and recurrent trajectories from tumor mutation trees. Nature Communications , 14(1):3676

  26. [26]

    F., and Martin, W

    Maier, U.-G., Zauner, S., Woehle, C., Bolte, K., Hempel, F., Allen, J. F., and Martin, W. F. (2013). Massively Convergent Evolution for Ribosomal Protein Gene Content in Plastid and Mitochondrial Genomes . Genome Biology and Evolution , 5(12):2318--2329

  27. [27]

    Moen, M. T. and Johnston, I. G. (2023). HyperHMM : efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs. Bioinformatics , 39(1):btac803

  28. [28]

    F., Beerenwinkel, N., and The Swiss HIV Cohort Study (2016)

    Montazeri, H., Kuipers, J., Kouyos, R., Böni, J., Yerly, S., Klimkait, T., Aubert, V., Günthard, H. F., Beerenwinkel, N., and The Swiss HIV Cohort Study (2016). Large-scale inference of conjunctive Bayesian networks. Bioinformatics , 32(17):i727--i735

  29. [29]

    Murray, C. J. L., Ikuta, K. S., Sharara, F., Swetschinski, L., Robles Aguilar, G., Gray, A., Han, C., Bisignano, C., Rao, P., Wool, E., Johnson, S. C., Browne, A. J., Chipeta, M. G., Fell, F., Hackett, S., Haines-Woodhouse, G., Kashef Hamadani, B. H., Kumaran, E. A. P., McManigal, B., Achalapong, S., Agarwal, R., Akech, S., Albertson, S., Amuasi, J., Andr...

  30. [30]

    G., Bonomo, R

    Nichol, D., Jeavons, P., Fletcher, A. G., Bonomo, R. A., Maini, P. K., Paul, J. L., Gatenby, R. A., Anderson, A. R. A., and Scott, J. G. (2015). Steering Evolution with Sequential Therapy to Prevent the Emergence of Bacterial Antibiotic Resistance . PLOS Computational Biology , 11(9):e1004493

  31. [31]

    B., Coombes, K

    Nicol, P. B., Coombes, K. R., Deaver, C., Chkrebtii, O., Paul, S., Toland, A. E., and Asiaee, A. (2021). Oncogenetic network estimation with disjunctive Bayesian networks. Computational and Systems Oncology , 1(2):e1027

  32. [32]

    Pagel, M. (1994). Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters. Proceedings of the Royal Society , 255(1342):37--45

  33. [33]

    Pedersen, T. L. (2024). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks

  34. [34]

    A., Aga, O

    Renz, J., Dauda, K. A., Aga, O. N. L., Diaz-Uriarte, R., Löhr, I. H., Blomberg, B., and Johnston, I. G. (2024). Evolutionary accumulation modelling in AMR : machine learning to infer and predict evolutionary dynamics of multi-drug resistance

  35. [35]

    Rupp, K., Schill, R., Süskind, J., Georg, P., Klever, M., Lösch, A., Grasedyck, L., Wettig, T., and Spang, R. (2024). Differentiated uniformization: a new method for inferring Markov chains on combinatorial state spaces including stochastic epidemic models. Computational Statistics

  36. [36]

    and Curtin, R

    Sanderson, C. and Curtin, R. (2016). Armadillo: a template-based C ++ library for linear algebra. Journal of Open Source Software , 1(2):26

  37. [37]

    and Curtin, R

    Sanderson, C. and Curtin, R. (2019). Practical Sparse Matrices in C ++ with Hybrid Storage and Template - Based Expression Optimisation . Mathematical and Computational Applications , 24(3)

  38. [38]

    L., Vocht, S., Rupp, K., Grasedyck, L., Spang, R., and Beerenwinkel, N

    Schill, R., Klever, M., Lösch, A., Hu, Y. L., Vocht, S., Rupp, K., Grasedyck, L., Spang, R., and Beerenwinkel, N. (2024a). Overcoming Observation Bias for Cancer Progression Modeling . In Ma, J., editor, Research in Computational Molecular Biology , pages 217--234, Cham. Springer Nature Switzerland

  39. [39]

    L., Lösch, A., Georg, P., Pfahler, S., Vocht, S., Hansch, S., Wettig, T., Grasedyck, L., and Spang, R

    Schill, R., Klever, M., Rupp, K., Hu, Y. L., Lösch, A., Georg, P., Pfahler, S., Vocht, S., Hansch, S., Wettig, T., Grasedyck, L., and Spang, R. (2024b). Reconstructing Disease Histories in Huge Discrete State Spaces . KI - Künstliche Intelligenz

  40. [40]

    Schill, R., Solbrig, S., Wettig, T., and Spang, R. (2020). Modelling cancer progression using Mutual Hazard Networks . Bioinformatics , 36(1):241--249

  41. [41]

    and Schäffer, A

    Schwartz, R. and Schäffer, A. A. (2017). The evolution of tumour phylogenetics: principles and practice. Nature Reviews Genetics , 18(4):213--229

  42. [42]

    Slowikowski, K. (2024). ggrepel: Automatically Position Non - Overlapping Text Labels with 'ggplot2'. https://ggrepel.slowkow.com/, https://github.com/slowkow/ggrepel

  43. [43]

    and Boucher, K

    Szabo, A. and Boucher, K. (2002). Estimating an oncogenetic tree when false negatives and positives are present. Mathematical Biosciences , 176(2):219--236

  44. [44]

    and Ushijima, T

    Takeshima, H. and Ushijima, T. (2019). Accumulation of genetic and epigenetic alterations in normal cells and cancer risk. npj Precision Oncology , 3(1):1--8

  45. [45]

    X., and Gore, J

    Tan, L., Serene, S., Chao, H. X., and Gore, J. (2011). Hidden Randomness between Fitness Landscapes Limits Reverse Evolution . Physical Review Letters , 106(19):198102

  46. [46]

    Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York

  47. [47]

    Wickham, H. (2023). stringr: Simple , Consistent Wrappers for Common String Operations

  48. [48]

    and Bryan, J

    Wickham, H. and Bryan, J. (2025). readxl: Read Excel Files . R package version 1.4.5, https://github.com/tidyverse/readxl

  49. [49]

    Wickham, H., François, R., Henry, L., Müller, K., and Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation . R package version 1.1.4, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org

  50. [50]

    Wickham, H., Vaughan, D., and Girlich, M. (2024). tidyr: Tidy Messy Data . R package version 1.3.1, https://github.com/tidyverse/tidyr

  51. [51]

    and Simon, R

    Youn, A. and Simon, R. (2012). Estimating the order of mutations during tumorigenesis from tumor genome sequencing data. Bioinformatics , 28(12):1555--1561