pith. sign in

arxiv: 2605.21725 · v1 · pith:IA2NYU6Unew · submitted 2026-05-20 · 🧬 q-bio.PE · math.CO

Regularizing and Normalizing DAGs and Phylogenetic Networks

Pith reviewed 2026-05-22 07:31 UTC · model grok-4.3

classification 🧬 q-bio.PE math.CO
keywords phylogenetic networksdirected acyclic graphsleast common ancestorsregularizationnormalizationclustersvisibilityHasse diagram
0
0 comments X

The pith

i-regularization retains exactly the unique LCAs of at most i leaves in a DAG, removes the rest via the ⊖ operator, deletes shortcuts, and yields a network isomorphic to the Hasse diagram of the induced lca-clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Phylogenetic networks and DAGs encode evolutionary histories that may include reticulations such as hybridization. A central task is to identify which vertices and edges carry essential leaf-observable structure and which can be deleted without loss. The paper defines i-regularization as the process that keeps only vertices serving as unique least common ancestors of leaf subsets of size at most i, applies a graph-editing removal step, and then eliminates shortcuts. It proves that the resulting network still displays exactly the same LCAs, satisfies an i-lca-relevance property, and admits a direct description as the Hasse diagram of the corresponding clusters. The same editing operator is then used to compare this regularization with existing normalization procedures and to characterize when the two coincide.

Core claim

For any DAG G and integer i ≥ 1, the regularized DAG reg_i(G) retains precisely the vertices that occur as unique LCAs of leaf subsets of size at most i, removes all other non-leaf vertices by the ⊖ operation, and deletes shortcuts; the resulting network preserves every such LCA, is i-lca-relevant, and is isomorphic to the Hasse diagram of the lca-clusters.

What carries the argument

The i-regularization operator that selects unique LCAs of small leaf sets, applies the ⊖ removal, and deletes shortcuts to produce a regular DAG.

If this is right

  • The simplified network admits a purely cluster-based description without reference to internal vertices.
  • Regularization and normalization coincide precisely when the visible vertices removed by normalization are exactly the non-unique LCAs.
  • A single editing framework unifies cluster-based, LCA-based, and visibility-based simplifications of DAGs and phylogenetic networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization could be applied iteratively with increasing i to produce a sequence of successively coarser networks.
  • Efficient computation of reg_i(G) would require only the enumeration of small LCAs rather than full transitive closure.
  • The construction suggests a way to compare simplification methods across different data types that admit LCA or cluster interpretations.

Load-bearing premise

The ⊖ editing step removes non-LCA vertices without changing which LCAs of small leaf sets remain observable from the leaves.

What would settle it

A concrete DAG in which applying reg_i changes the set of LCAs for some leaf subset of size at most i, or produces a network whose transitive closure of ancestor relations differs from that of the Hasse diagram of its lca-clusters.

Figures

Figures reproduced from arXiv: 2605.21725 by Anna Lindeberg, Marc Hellmuth, Vincent Moulton.

Figure 1
Figure 1. Figure 1: Two networks N1, N2 and two trees T1, T2, for which T1 = REG(N1) = NORM(N1) and T1 = REG(N2) ̸= NORM(N2) = T2, see Section 3 and 4 for more details. Three classical viewpoints on this question are central to this paper. The first is cluster in￾formation: every vertex v in a DAG G with leaf set X induces a cluster CG(v) = {x ∈ X | there is a directed path from v to x in G}, and the resulting set system C(G)… view at source ↗
Figure 2
Figure 2. Figure 2: Two DAGs that illustrate certain concepts related to least common ancestors; see the text for further details. In general, not every set A ⊆ X has a (unique) least common ancestor in a DAG; for example, consider the DAG G in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A network N on X = {a,b,c,d,e, f,g}. Numbers i next to a vertex v denote the smallest integers such that v ∈ lcai(N). If i = 0, then the respective vertex is not a k-lca vertex for any k. In addition several networks REGi(N) are shown. Note that REG1(N) = (X, /0) just consists of X and no arcs. Definition 3.6. Let G be a DAG on X. The set of lca-clusters of G is defined to be Clca(G) := {C ∈ C(G) | lcaG(C)… view at source ↗
Figure 4
Figure 4. Figure 4: A DAG G for which H(C(G)) ̸≃ REG(G) holds [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two networks N1 and N2 for which the regularization and normalization coincide, i.e., REG(N1) = N ′ 1 = NORM(N1) and REG(N2) = N ′ 2 = NORM(N2). Note that H(C(G)) and REG(G) may differ; see [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A network N on {a,b,c,d}. The table specifies which of the vertices of N are (non-)visible or (non-)k-lca vertices. Proof. Since the proof is straightforward, we have placed it in Appendix A. We now provide some characterizations of strongly normal networks. Proposition 4.5. Let N be a network. Then, the following three statements are equivalent. (1) N is strongly normal. (2) N is normal and 2-lca-relevant… view at source ↗
Figure 7
Figure 7. Figure 7: The network N has a unique root, while the corresponding DAG REG(N) has two. By Lemma 4.6, G ⊖U is strongly normal and so Lemma 4.4 implies that G ⊖U is 2-lca-relevant. Hence, there are leaves x,y ∈ X such that v = lcaG⊖U (x,y). In other words, LCAG⊖U ({x,y}) = {v}. This together with Lemma 2.11 implies that LCAG({x,y}) = {v} and, therefore, v = lcaG(x,y). By definition, v ∈ lca2(G) =V \lca2(G). In conclus… view at source ↗
Figure 8
Figure 8. Figure 8: Two networks N1, N2 and two trees T1, T2 on {a,b,c}, for which T1 = REG(N1) = NORM(N2) and T2 = NORM(N1) = REG(N2). different vertices as “omittable”. A technical difficulty in comparing regularization and normaliza￾tion is that the operator REG is invariant under the addition or removal of shortcuts. More precisely, REG(N −) = REG(N) always holds; cf. Lemma 3.10(2). In contrast, the normalization operator… view at source ↗
Figure 9
Figure 9. Figure 9: Some networks resulting from the N on the very left after application of the ⊖-operator and removal of shortcuts. In particular, note that (N ⊖u) − ⊖w is not isomorphic to (N ⊖ {u,w}) −. that w ̸= c. By definition of Cvis(v), it holds that v ̸= w and there is an vw-path in N and thus, w ≺N v and, furthermore, w ∈ vis(N) = V(COV(N)) which together with Eq. (1) implies that w ≺COV(N) v. Since c is the unique… view at source ↗
read the original abstract

Phylogenetic networks and, more generally, directed acyclic graphs (DAGs) represent hierarchical structure beyond trees, for instance in the presence of reticulate evolutionary events such as hybridization or horizontal gene transfer. A central question is which parts of such graphs are essential with respect to leaf-observable information, and which parts can be removed without changing this information. Resolving this question can lead to principled simplification methods for phylogenetic networks, such as the recent normalization approach of Francis et al. In this paper, we study this question from three related perspectives: clusters displayed by a DAG $G$, least common ancestors (LCAs) of subsets of its leaf set, and visibility, a path-based property of vertices. We first introduce an LCA-based simplification procedure called $i$-regularization. For a DAG $G$ and $i\geq 1$, the DAG $\reg_i(G)$ retains precisely those vertices that occur as unique LCAs of leaf subsets of size at most $i$, removes the remaining non-leaf vertices by a graph-editing operation $\ominus$, and then deletes shortcuts. We show that $\reg_i(G)$ preserves all such LCAs, is $i$-lca-relevant, and admits a cluster-level description: it is regular, i.e., isomorphic to the Hasse diagram of the corresponding lca-clusters. We then compare LCA-based regularization with normalization. Using the same $\ominus$-operator, we describe the cover construction underlying normalization, identify visible vertices that are nevertheless removed, and characterize when regularization and normalization coincide. Together, these results provide a unified framework for cluster-based, LCA-based, and visibility-based simplifications of DAGs and phylogenetic networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces an i-regularization operator reg_i for DAGs and phylogenetic networks. For given i ≥ 1, reg_i(G) retains exactly the vertices that serve as unique LCAs of leaf subsets of size at most i, removes all other non-leaf vertices via the editing operation ⊖, and then deletes shortcuts. The central theorems assert that reg_i(G) preserves all such LCAs, is i-lca-relevant, and is regular (isomorphic to the Hasse diagram of the corresponding lca-clusters). The paper further compares this construction to the normalization procedure of Francis et al., using the same ⊖ operator to characterize visible vertices removed by normalization and to identify conditions under which the two simplifications coincide.

Significance. If the preservation and regularity claims hold, the work supplies a unified, cluster/LCA/visibility-based framework for simplifying phylogenetic networks while retaining leaf-observable information. This extends existing normalization methods and could support more principled reduction of reticulate networks in evolutionary biology.

major comments (1)
  1. [regularization procedure / definition of reg_i and ⊖] Definition of the ⊖ editing operation and the subsequent shortcut-deletion step (abstract and the regularization procedure section): the load-bearing claim that removing non-unique-LCA vertices via ⊖ followed by shortcut deletion leaves the set of LCAs for all leaf subsets of size ≤ i unchanged is not accompanied by an explicit argument or counter-example check for DAGs containing reticulations or multiple paths between leaves. This step is invoked both in the definition of reg_i and in the stated preservation theorem; without a self-contained verification, the central simplification result remains at risk.
minor comments (2)
  1. [Abstract and introduction] The term 'i-lca-relevant' is used in the abstract and theorems without a one-sentence gloss; adding a brief parenthetical definition on first use would improve readability for readers outside the immediate subfield.
  2. [Throughout the manuscript] Notation for the editing operation alternates between ⊖ and ominus in the text; adopt a single symbol throughout and ensure it is defined before its first use in the regularization construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comment below and will revise the paper to strengthen the presentation of the regularization procedure and its proofs.

read point-by-point responses
  1. Referee: Definition of the ⊖ editing operation and the subsequent shortcut-deletion step (abstract and the regularization procedure section): the load-bearing claim that removing non-unique-LCA vertices via ⊖ followed by shortcut deletion leaves the set of LCAs for all leaf subsets of size ≤ i unchanged is not accompanied by an explicit argument or counter-example check for DAGs containing reticulations or multiple paths between leaves. This step is invoked both in the definition of reg_i and in the stated preservation theorem; without a self-contained verification, the central simplification result remains at risk.

    Authors: We appreciate the referee highlighting the need for greater explicitness here. Theorem 3.2 establishes LCA preservation by showing that reg_i(G) retains precisely the unique LCAs of leaf sets of size at most i, that ⊖ removes only non-unique-LCA vertices without altering the LCAs of retained leaf sets, and that shortcut deletion neither creates nor destroys these LCAs. To make this more self-contained for reticulate DAGs and cases with multiple paths, we will add a new lemma (with proof) that walks through the effect of each step on paths and LCAs, together with two explicit examples: one reticulate network with multiple paths between leaves and one where a non-unique LCA is removed. This revision will directly address the concern while preserving the existing theorems. revision: yes

Circularity Check

0 steps flagged

No circularity: claims follow from explicit definitions and graph operations

full rationale

The paper introduces the reg_i operator and ⊖ editing step as new definitions that retain vertices which are unique LCAs of leaf subsets of size ≤i, then deletes shortcuts. It states theorems that the resulting DAG preserves those LCAs, is i-lca-relevant, and is regular (isomorphic to the Hasse diagram of lca-clusters). These are derived consequences of the definitions using standard DAG operations rather than any quantity defined in terms of the output itself. No fitted parameters, self-citations, or imported uniqueness theorems appear in the load-bearing steps. The derivation is therefore self-contained against the paper's own stated assumptions and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The paper relies on standard directed-graph axioms and the definition of least common ancestors; the new entities are the regularization operator and the ⊖ editing step, both introduced without external empirical support.

axioms (2)
  • standard math A DAG admits a well-defined least common ancestor for every non-empty subset of its leaves.
    Invoked when the authors define the vertices retained by reg_i(G) as unique LCAs of leaf subsets.
  • domain assumption The ⊖ operator removes a vertex while preserving all paths between remaining vertices.
    Used in the definition of regularization and in the cover construction for normalization.
invented entities (2)
  • i-regularization operator reg_i no independent evidence
    purpose: To produce a simplified DAG that retains exactly the unique LCAs of leaf subsets of size at most i.
    Newly defined simplification procedure; no independent falsifiable prediction outside the paper is supplied.
  • ⊖ editing operation no independent evidence
    purpose: To remove non-retained vertices while maintaining the ancestor relations among the kept vertices.
    Introduced as the common editing step for both regularization and normalization.

pith-pipeline@v0.9.0 · 5841 in / 1615 out tokens · 54583 ms · 2026-05-22T07:31:08.692180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [2]

    IEEE Transactions on Knowledge and Data Engineering 16(9):1128–1142, DOI 10.1109/TKDE.2004

    van der Aalst WMP (2004) Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering 16(9):1128–1142, DOI 10.1109/TKDE.2004. 47

  2. [3]

    SIAM Journal on Com- puting 10(3):405–421, DOI 10.1137/0210030

    Aho A V , Sagiv Y , Szymanski TG, Ullman JD (1981) Inferring a tree from lowest common an- cestors with an application to the optimization of relational expressions. SIAM Journal on Com- puting 10(3):405–421, DOI 10.1137/0210030

  3. [4]

    Annals of Combinatorics 10(1):19–30, DOI 10.1007/s00026-006-0271-0

    Baroni M, Steel M (2006) Accumulation phylogenies. Annals of Combinatorics 10(1):19–30, DOI 10.1007/s00026-006-0271-0

  4. [5]

    Ann Comb 8:391–408, DOI 10.1007/s00026-004-0228-0

    Baroni M, Semple C, Steel M (2005) A framework for representing reticulate evolution. Ann Comb 8:391–408, DOI 10.1007/s00026-004-0228-0

  5. [6]

    Systematic Biology 55(1):46–56, DOI 10.1080/10635150500431197

    Baroni M, Semple C, Steel M (2006) Hybrids in real time. Systematic Biology 55(1):46–56, DOI 10.1080/10635150500431197

  6. [7]

    IEEE/ACM Trans Comp Biol Bioinf 6:552–569, DOI 10.1109/TCBB.2007.70270

    Cardona G, Rossell ´o F, Valiente G (2009) Comparison of tree-child phylogenetic networks. IEEE/ACM Trans Comp Biol Bioinf 6:552–569, DOI 10.1109/TCBB.2007.70270

  7. [8]

    Theoret- ical Computer Science 796:129–146, DOI 10.1016/j.tcs.2019.09.003

    D ¨ocker J, Linz S, Semple C (2019) Displaying trees across two phylogenetic networks. Theoret- ical Computer Science 796:129–146, DOI 10.1016/j.tcs.2019.09.003

  8. [9]

    Journal of Theoretical Biology 265(4):535–542, DOI 10.1016/j.jtbi.2010

    Dress A, Moulton V , Steel M, Wu T (2010) Species, clusters and the ’tree of life’: A graph- theoretic perspective. Journal of Theoretical Biology 265(4):535–542, DOI 10.1016/j.jtbi.2010. 05.031

  9. [10]

    Journal of Theoretical Biology 614:112236, DOI 10.1016/j.jtbi.2025.112236

    Francis A (2025) ”Normal” phylogenetic networks may be emerging as the leading class. Journal of Theoretical Biology 614:112236, DOI 10.1016/j.jtbi.2025.112236

  10. [11]

    Molecular Phyloge- netics and Evolution 163:107215, DOI 10.1016/j.ympev.2021.107215

    Francis A, Huson DH, Steel M (2021) Normalising phylogenetic networks. Molecular Phyloge- netics and Evolution 163:107215, DOI 10.1016/j.ympev.2021.107215

  11. [12]

    Journal of Mathematical Biology 80(3):865–953, DOI 10.1007/s00285-019-01444-2

    Geiß M, Stadler PF, Hellmuth M (2020) Reciprocal best match graphs. Journal of Mathematical Biology 80(3):865–953, DOI 10.1007/s00285-019-01444-2

  12. [13]

    The MIT Press, DOI 10.7551/mitpress/9432.001.0001

    Gusfield D (2014) ReCombinatorics: the algorithmics of ancestral recombination graphs and explicit phylogenetic networks. The MIT Press, DOI 10.7551/mitpress/9432.001.0001

  13. [14]

    Bulletin of Mathematical Biology 87(2):20, DOI 10.1007/s11538-024-01398-7

    Heiss J, Huson DH, Steel M (2025) Transformations to simplify phylogenetic networks. Bulletin of Mathematical Biology 87(2):20, DOI 10.1007/s11538-024-01398-7

  14. [15]

    Discrete Applied Mathematics 378:584–593, DOI 10.1016/j.dam.2025.08.037

    Hellmuth M, Lindeberg A (2026) Characterizing and transforming DAGs within the I-lca frame- work. Discrete Applied Mathematics 378:584–593, DOI 10.1016/j.dam.2025.08.037

  15. [16]

    Theory in Biosciences 142(4):301–358, DOI 10.1007/s12064-023-00398-w

    Hellmuth M, Schaller D, Stadler PF (2023) Clustering systems of phylogenetic networks. Theory in Biosciences 142(4):301–358, DOI 10.1007/s12064-023-00398-w

  16. [17]

    Journal of Classification 36(2):200–231, DOI 10.1007/s00357-018-9279-5

    Huber KT, Moulton V , Wu T (2019) Hierarchies from lowest stable ancestors in nonbinary phy- logenetic networks. Journal of Classification 36(2):200–231, DOI 10.1007/s00357-018-9279-5

  17. [18]

    Genome Biol Evol 3:23–35, DOI 10.1093/gbe/evq077

    Huson DH, Scornavacca C (2011) A survey of combinatorial methods for phylogenetic networks. Genome Biol Evol 3:23–35, DOI 10.1093/gbe/evq077

  18. [19]

    MIT Press

    Koller D, Friedman N (2009) Probabilistic Graphical Models: Principles and Techniques. MIT Press

  19. [20]

    Bull Math Biol 87(3):44, DOI 10.1007/ s11538-025-01419-z 23

    Lindeberg A, Hellmuth M (2025) Simplifying and characterizing DAGs and phylogenetic networks via least common ancestor constraints. Bull Math Biol 87(3):44, DOI 10.1007/ s11538-025-01419-z 23

  20. [21]

    Journal of Mathematical Biology 81(4):961–980, DOI 10.1007/ s00285-020-01533-7

    Linz S, Semple C (2020) Caterpillars on three and four leaves are sufficient to reconstruct binary normal networks. Journal of Mathematical Biology 81(4):961–980, DOI 10.1007/ s00285-020-01533-7

  21. [22]

    IEEE/ACM Transactions on Com- putational Biology and Bioinformatics 1(1):13–23, DOI 10.1109/TCBB.2004.10

    Moret B, Nakhleh L, Warnow T, Linder C, Tholse A, Padolina A, Sun J, Timme R (2004) Phylo- genetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Com- putational Biology and Bioinformatics 1(1):13–23, DOI 10.1109/TCBB.2004.10

  22. [23]

    Cambridge University Press, DOI 10.1142/S0218126698000043

    Pearl J (2009) Causality: Models, Reasoning, and Inference, 2nd edn. Cambridge University Press, DOI 10.1142/S0218126698000043

  23. [24]

    Oxford University Press, Oxford, UK

    Semple C, Steel M (2003) Phylogenetics, Oxford Lecture Series in Mathematics and its Appli- cations, vol 24. Oxford University Press, Oxford, UK

  24. [25]

    In: Kalyanasundaram S, Maheshwari A (eds) Algorithms and Discrete Applied Mathematics, Springer Nature Switzerland, Cham, pp 148–161, DOI 10.1007/ 978-3-031-52213-0 11

    Shanavas A V , Changat M, Hellmuth M, Stadler PF (2024) Unique least common ancestors and clusters in directed acyclic graphs. In: Kalyanasundaram S, Maheshwari A (eds) Algorithms and Discrete Applied Mathematics, Springer Nature Switzerland, Cham, pp 148–161, DOI 10.1007/ 978-3-031-52213-0 11

  25. [26]

    Algorithms for Molecular Biology 15(1):5, DOI 10.1186/s13015-020-00165-2

    Stadler PF, Geiß M, Schaller D, L ´opez S´anchez A, Gonz´alez Laffitte M, Valdivia DI, Hellmuth M, Hern ´andez Rosales M (2020) From pairs of most similar sequences to phylogenetic best matches. Algorithms for Molecular Biology 15(1):5, DOI 10.1186/s13015-020-00165-2

  26. [27]

    IEEE/ACM Transac- tions on Computational Biology and Bioinformatics 9(4):1128–1138, DOI 10.1109/TCBB.2012

    Willson S (2012) CSD Homomorphisms between Phylogenetic Networks. IEEE/ACM Transac- tions on Computational Biology and Bioinformatics 9(4):1128–1138, DOI 10.1109/TCBB.2012. 52

  27. [28]

    Bulletin of Mathematical Biology 72(2):340–358, DOI 10.1007/s11538-009-9449-z

    Willson SJ (2010) Properties of normal phylogenetic networks. Bulletin of Mathematical Biology 72(2):340–358, DOI 10.1007/s11538-009-9449-z

  28. [29]

    Bulletin of Mathe- matical Biology 73(10):2322–2338, DOI 10.1007/s11538-010-9624-2

    Willson SJ (2011) Restricted trees: Simplifying networks with bottlenecks. Bulletin of Mathe- matical Biology 73(10):2322–2338, DOI 10.1007/s11538-010-9624-2

  29. [30]

    Annals of Combinatorics 20(4):917–938, DOI 10.1007/s00026-016-0324-y

    Willson SJ (2016) Comparing and simplifying distinct-cluster phylogenetic networks. Annals of Combinatorics 20(4):917–938, DOI 10.1007/s00026-016-0324-y

  30. [31]

    Bulletin of Mathematical Biology 84(2):26, DOI 10.1007/s11538-021-00986-1 24

    Willson SJ (2022) Merging arcs to produce acyclic phylogenetic networks and normal networks. Bulletin of Mathematical Biology 84(2):26, DOI 10.1007/s11538-021-00986-1 24