pith. sign in

arxiv: 2511.22736 · v2 · submitted 2025-11-27 · 🧬 q-bio.PE · math.CO

Bounds on the sequence length sufficient to reconstruct binary level-1 phylogenetic networks under the CFN model

Pith reviewed 2026-05-17 04:34 UTC · model grok-4.3

classification 🧬 q-bio.PE math.CO
keywords phylogenetic networksCFN modelsequence length boundslevel-1 networkssemi-directed networksquartet inferencereticulation eventsnetwork reconstruction
0
0 comments X

The pith

Binary level-1 phylogenetic networks can be reconstructed from finite genetic sequences whose length scales logarithmically, polynomially, or polylogarithmically with the number of taxa under the CFN model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that binary level-1 semi-directed phylogenetic networks can be uniquely reconstructed from genetic sequence data of sufficient length under the Cavender-Farris-Neyman model of evolution. It introduces an inference algorithm that achieves this reconstruction with high probability, where the required sequence length depends on the specific parameter regime and grows no faster than polylogarithmically in the number of taxa. This addresses a key gap in understanding data requirements for network inference, as opposed to trees, and provides practical bounds for genomic data collection in evolutionary studies involving reticulation events.

Core claim

We present an inference algorithm that takes as input genetic sequence data and demonstrate that the sequence length sufficient to reconstruct the correct network with high probability, under the CFN model of evolution, scales logarithmically, polynomially, or polylogarithmically with the number of taxa, depending on the parameter regime. Novel inference rules for quartet data in the semi-directed phylogenetic network setting are also provided.

What carries the argument

An inference algorithm that combines novel quartet inference rules for semi-directed networks with statistically consistent reconstruction procedures for binary level-1 networks.

If this is right

  • Correct reconstruction occurs with high probability whenever sequence length satisfies the derived scaling bounds.
  • The method applies directly to any binary level-1 semi-directed network with vertex-disjoint cycles.
  • Reconstruction remains possible even as the number of taxa grows, provided sequence length follows the appropriate regime-dependent bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling analysis could be extended to other sequence evolution models or to networks that allow intersecting cycles.
  • Experimental designs in genomics could use these bounds to decide how much data is needed before attempting network inference on reticulate histories.
  • The novel quartet rules might serve as building blocks for consistency proofs in broader classes of phylogenetic networks.

Load-bearing premise

The underlying inference methods are statistically consistent and recover the true network from infinite data, and the input networks are binary level-1 semi-directed networks with vertex-disjoint cycles.

What would settle it

A specific binary level-1 network and choice of parameters where the probability of correct reconstruction falls below the high-probability threshold even when the input sequence length meets the claimed polylogarithmic bound.

Figures

Figures reproduced from arXiv: 2511.22736 by Leo van Iersel, Mark Jones, Martin Frohn, Niels Holtgrefe, Steven Kelk.

Figure 1
Figure 1. Figure 1: (a): A rooted phylogenetic network N + on X = {x1 , . . . , x13}. (b): The semi-directed phylogenetic level-1 network N on X obtained from N +. (c): Two unrooted phylogenetic trees T1 and T2 on X displayed by N. (d): Four quartets displayed by N. The quartet x3 x4 |x5 x6 does not form a quartet profile of N, since N also displays another quartet (x3 x6 |x4 x5 , not depicted in the figure) on the same four … view at source ↗
Figure 2
Figure 2. Figure 2: All semi-directed level-1 networks on {a, b,c, d, e} with quartet profiles (abcd) and bc|de (disregarding possible 2- and 3-cycles). The networks in subfigure (a) have a 5-cycle and the networks in subfigure (b) have a 4-cycle. Case 1: N is a sunlet. Since bc|de is a trivial quartet profile, the subnetwork of N induced by {b,c, d, e} also has the split bc|de. Hence, a is the leaf below the reticulation of … view at source ↗
Figure 3
Figure 3. Figure 3: (a): A sunlet network on {x1 , . . . , xn }. The blob B induces the circular order ({x1 }, . . . , {xn }). (b/c): Two types of semi-directed level-1 networks on X1 ∪ . . . ∪ Xs ∪ { y1 } ∪ . . . ∪ { yt }, with s, t ≥ 2, 1 ≤ r ≤ t and 1 ≤ p ≤ s (for subfigure (c)). The yj represent leaves, and the Xi represent subsets of leaves. The networks both have a nontrivial cut-edge uv (with its endpoints in the blobs… view at source ↗
Figure 4
Figure 4. Figure 4: A semi-directed network on {x1 , . . . , x11} used to illustrate the concept of a representative quartet profile. In the remainder of this section we prove that the representative quartet profiles are sufficient to build up the complete set of quartet profiles, solely using dyadic inference rules. We formalize this by means of the following definition and stress that the triadic rule R4∗ is not being appli… view at source ↗
read the original abstract

Phylogenetic trees and networks are graphs used to model evolutionary relationships, with trees representing strictly branching histories and networks allowing for events in which lineages merge, called reticulation events. While the question of data sufficiency has been studied extensively in the context of trees, it remains largely unexplored for networks. In this work we take a first step in this direction by establishing bounds on the amount of genomic data required to reconstruct binary level-$1$ semi-directed phylogenetic networks, which are binary networks in which reticulation events are indicated by directed edges, all other edges are undirected, and cycles are vertex-disjoint. For this class, methods have been developed recently that are statistically consistent. Roughly speaking, such methods are guaranteed to reconstruct the correct network assuming infinitely long genomic sequences. Here we consider the question whether networks from this class can be uniquely and correctly reconstructed from finite sequences. Specifically, we present an inference algorithm that takes as input genetic sequence data, and demonstrate that the sequence length sufficient to reconstruct the correct network with high probability, under the CFN model of evolution, scales logarithmically, polynomially, or polylogarithmically with the number of taxa, depending on the parameter regime. As part of our contribution, we also present novel inference rules for quartet data in the semi-directed phylogenetic network setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents an inference algorithm for binary level-1 semi-directed phylogenetic networks that takes genetic sequence data as input under the CFN model. It establishes explicit bounds showing that the sequence length sufficient for high-probability correct reconstruction scales logarithmically, polynomially, or polylogarithmically with the number of taxa, depending on the parameter regime (specifically, how far the relevant quartet probabilities are from 1/2). Novel inference rules for quartet data in the semi-directed setting are introduced to support the algorithm.

Significance. If the bounds hold, this is a valuable first step toward finite-data guarantees for network reconstruction, extending well-studied tree results to a class of networks with existing consistent methods. The explicit regime-dependent scalings and the combination of quartet consistency with concentration inequalities provide concrete, falsifiable predictions about data requirements that could guide experimental design.

major comments (2)
  1. [Main theorems] Main theorems (around the derivation of the three scaling regimes): the union bound over O(n^4) quartets is applied to obtain the overall error probability, but the per-quartet success probability in the polylogarithmic regime is stated to be 1/polylog(n) away from 1/2; it is not immediately clear from the text whether this separation holds uniformly for quartets that include reticulation edges or only for tree-like quartets.
  2. [Quartet inference rules] Section on novel quartet inference rules: while the rules are defined and claimed to be statistically consistent, the extension of infinite-data consistency to the semi-directed case (accounting for the direction of reticulation edges) is used as a black box in the finite-sample analysis; a short self-contained argument or reference to the precise consistency theorem would make the load-bearing step more transparent.
minor comments (3)
  1. [Abstract] The abstract states the three scaling regimes but does not briefly indicate the corresponding conditions on edge lengths or reticulation parameters; adding one sentence would improve accessibility.
  2. [Preliminaries] Notation for the CFN parameters (e.g., edge lengths and reticulation probabilities) is introduced but could be collected in a single table or preliminary section for easier reference when reading the bounds.
  3. [Discussion] A few sentences in the discussion could explicitly contrast the obtained bounds with known sequence-length requirements for tree reconstruction under the same model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address the major comments point by point below. The revisions we have made aim to enhance the clarity and transparency of our proofs and arguments without altering the main results.

read point-by-point responses
  1. Referee: [Main theorems] Main theorems (around the derivation of the three scaling regimes): the union bound over O(n^4) quartets is applied to obtain the overall error probability, but the per-quartet success probability in the polylogarithmic regime is stated to be 1/polylog(n) away from 1/2; it is not immediately clear from the text whether this separation holds uniformly for quartets that include reticulation edges or only for tree-like quartets.

    Authors: We appreciate the referee's careful attention to the details of our main theorems. The separation from 1/2 in the polylogarithmic regime is indeed uniform and holds for all quartets, including those that incorporate reticulation edges. This uniformity arises because our bounds on the quartet probabilities under the CFN model on binary level-1 networks account for the possible presence of reticulation events through the minimum edge length parameter and the reticulation probability, ensuring a positive separation that scales appropriately with n in the relevant regime. To address the lack of immediate clarity, we have revised the text in the proof of the main theorem to explicitly note that the per-quartet error probability bound applies uniformly across all O(n^4) quartets, with a parenthetical remark confirming its validity for reticulation-involving quartets. This is a partial revision as the underlying mathematics already supported this, but the exposition has been improved. revision: partial

  2. Referee: [Quartet inference rules] Section on novel quartet inference rules: while the rules are defined and claimed to be statistically consistent, the extension of infinite-data consistency to the semi-directed case (accounting for the direction of reticulation edges) is used as a black box in the finite-sample analysis; a short self-contained argument or reference to the precise consistency theorem would make the load-bearing step more transparent.

    Authors: We agree with the referee that greater transparency in the consistency argument would be beneficial. While the rules are statistically consistent by construction in the infinite-data limit, we acknowledge that the extension to the semi-directed setting was referenced rather than fully derived in the finite-sample section. In the revised manuscript, we have added a concise self-contained argument in Section 4 that demonstrates how the novel inference rules recover both the topology and the directions of reticulation edges from the limiting quartet probabilities under the CFN model. This argument builds directly on the probability calculations presented earlier in the paper. Additionally, we have included a reference to the relevant consistency result for semi-directed network reconstruction. This constitutes a full revision to the section in question. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper explicitly constructs novel quartet inference rules for binary level-1 semi-directed networks and derives sequence-length bounds via union bounds over O(n^4) quartets combined with per-quartet concentration probabilities under the CFN model. These probabilities are bounded away from 1/2 in different regimes (constant, 1/poly(n), or 1/polylog(n)) depending on edge lengths and reticulation parameters, yielding the claimed logarithmic, polynomial, or polylogarithmic scalings directly from the probabilistic analysis. The infinite-data consistency of the underlying methods is invoked as a black-box starting point but is not required to be re-derived here; the finite-sample error analysis stands independently without reducing any claimed bound to a fitted parameter, self-definition, or load-bearing self-citation chain. No step equates a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the CFN model and the level-1 network class; no free parameters or invented entities are visible in the abstract.

axioms (2)
  • domain assumption The CFN model accurately describes sequence evolution for the purpose of the bounds.
    The model is the setting in which the sequence-length bounds are proved.
  • domain assumption Input networks are binary level-1 semi-directed with vertex-disjoint cycles.
    The class for which the algorithm and bounds are stated.

pith-pipeline@v0.9.0 · 5552 in / 1213 out tokens · 60963 ms · 2026-05-17T04:34:36.720199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    The tree of blobs of a species network: identifiability under the coalescent

    E. S. Allman, H. Baños, J. D. Mitchell, and J. A. Rhodes. “The tree of blobs of a species network: identifiability under the coalescent”. In:Journal of Mathematical Biology86.1 (2023), p. 10

  2. [2]

    TINNiK: inference of the tree of blobs of a species network un- der the coalescent model

    E. S. Allman, H. Baños, J. D. Mitchell, and J. A. Rhodes. “TINNiK: inference of the tree of blobs of a species network un- der the coalescent model”. In:Algorithms for Molecular Biology19.1 (2024), p. 23. https://doi.org/10.1186/s13015- 024-00266-2

  3. [3]

    NANUQ: a method for inferring species networks from gene trees under the coalescent model

    E. S. Allman, H. Baños, and J. A. Rhodes. “NANUQ: a method for inferring species networks from gene trees under the coalescent model”. In:Algorithms for Molecular Biology14.1 (2019), p. 24

  4. [4]

    NANUQ +: A divide-and-conquer approach to network estimation

    E. S. Allman, H. Baños, J. A. Rhodes, and K. Wicke. “NANUQ +: A divide-and-conquer approach to network estimation”. In:Algorithms for Molecular Biology20.1 (2025), p. 14. [5]N. Alon and J. H. Spencer.The Probabilistic Method. John Wiley & Sons, New York, 1992

  5. [5]

    Reconstructing the shape of a tree from observed dissimilarity data

    H.-J. Bandelt and A. Dress. “Reconstructing the shape of a tree from observed dissimilarity data”. In:Advances in Applied Mathematics7.3 (1986), pp. 309–343

  6. [6]

    Identifying species network features from gene tree quartets under the coalescent model

    H. Baños. “Identifying species network features from gene tree quartets under the coalescent model”. In:Bulletin of Mathematical Biology81 (2019), pp. 494–534. 20

  7. [7]

    Networks: expanding evolutionary thinking

    E. Bapteste, L. van Iersel, A. Janke, S. Kelchner, S. Kelk, J. O. McInerney, D. A. Morrison, L. Nakhleh, M. Steel, L. Stougie, et al. “Networks: expanding evolutionary thinking”. In:Trends in Genetics29.8 (2013), pp. 439–441. [9]P . Buneman. “A note on the metric properties of trees”. In:J. Combin. Theory Ser. B17.1 (1974), pp. 48–50

  8. [8]

    Tree structures for proximity data

    H. Colonius and H.-H. Schulze. “Tree structures for proximity data”. In:British Journal of Mathematical and Statistical Psychology34.2 (1981), pp. 167–180

  9. [9]

    Evolutionary trees can be learned in polynomial time in the two-state general Markov model

    M. Cryan, L. A. Goldberg, and P . W . Goldberg. “Evolutionary trees can be learned in polynomial time in the two-state general Markov model”. In:SIAM Journal on Computing31.2 (2001), pp. 375–397. [12]M. Cs ˝urös and M.-Y. Kao. “Recovering evolutionary trees through harmonic greedy triplets”. In:Proceedings of the Tenth Annual ACM-SIAM Symposium on Discret...

  10. [10]

    Identifiability of Phylogenetic Level-2 Networks under the Jukes-Cantor Model

    A. K. Englander, M. Frohn, E. Gross, N. Holtgrefe, L. van Iersel, M. Jones, and S. Sullivant. “Identifiability of Phylogenetic Level-2 Networks under the Jukes-Cantor Model”. In:bioRxiv(2025), pp. 2025–04

  11. [11]

    A few logs suffice to build (almost) all trees (I)

    P . L. Erd˝os, M. A. Steel, L. A. Székely, and T . J. Warnow. “A few logs suffice to build (almost) all trees (I)”. In: Random Structures & Algorithms14.2 (1999), pp. 153–184

  12. [12]

    A few logs suffice to build (almost) all trees: Part II

    P . L. Erd˝os, M. A. Steel, L. A. Székely, and T . J. Warnow. “A few logs suffice to build (almost) all trees: Part II”. In: Theoretical Computer Science221.1-2 (1999), pp. 77–118

  13. [13]

    Reconstructing semi-directed level-1 networks using few quarnets

    M. Frohn, N. Holtgrefe, L. van Iersel, M. Jones, and S. Kelk. “Reconstructing semi-directed level-1 networks using few quarnets”. In:Journal of Computer and System Sciences152 (2025), p. 103655

  14. [14]

    Distinguishing phylogenetic networks

    E. Gross and C. Long. “Distinguishing phylogenetic networks”. In:SIAM Journal on Applied Algebra and Geometry 2.1 (2018), pp. 72–93

  15. [15]

    Lower Bounds on the Sample Complexity of Species Tree Estimation when Substitution Rates Vary Across Loci

    M. Hill and S. Roch. “Lower Bounds on the Sample Complexity of Species Tree Estimation when Substitution Rates Vary Across Loci”. In:Bulletin of Mathematical Biology87.152 (2025)

  16. [16]

    Squirrel: Reconstructing semi-directed phylogenetic level-1 networks from four-leaved networks or sequence alignments

    N. Holtgrefe, K. T . Huber, L. van Iersel, M. Jones, S. Martin, and V . Moulton. “Squirrel: Reconstructing semi-directed phylogenetic level-1 networks from four-leaved networks or sequence alignments”. In:Molecular Biology and Evolution42.4 (2025), msaf067. https://doi.org/10.1093/molbev/msaf067

  17. [17]

    Quarnet inference rules for level-1 networks

    K. T . Huber, V . Moulton, C. Semple, and T . Wu. “Quarnet inference rules for level-1 networks”. In:Bulletin of Mathematical Biology80 (2018), pp. 2137–2153

  18. [18]

    Disk-covering, a fast-converging method for phylogenetic tree reconstruction

    D. H. Huson, S. M. Nettles, and T . J. Warnow. “Disk-covering, a fast-converging method for phylogenetic tree reconstruction”. In:Journal of Computational Biology6.3-4 (1999), pp. 369–386

  19. [19]

    Reconstructing a phylogenetic level-1 network from quartets

    J. Keijsper and R. Pendavingh. “Reconstructing a phylogenetic level-1 network from quartets”. In:Bulletin of Mathematical Biology76.10 (2014), pp. 2517–2541

  20. [20]

    On the complexity of distance-based evolutionary tree

    V . King, L. Zhang, and Y. Zhou. “On the complexity of distance-based evolutionary tree”. In:Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. 2003, p. 444

  21. [21]

    Classes of explicit phylogenetic networks and their biological and mathematical significance

    S. Kong, J. C. Pons, L. Kubatko, and K. Wicke. “Classes of explicit phylogenetic networks and their biological and mathematical significance”. In:Journal of Mathematical Biology84.6 (2022), p. 47

  22. [22]

    Inference of phylogenetic networks from sequence data using composite likelihood

    S. Kong, D. L. Swofford, and L. S. Kubatko. “Inference of phylogenetic networks from sequence data using composite likelihood”. In:Systematic Biology74.1 (2025), pp. 53–69

  23. [23]

    Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances

    J. A. Lake. “Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances.” In:Proceedings of the National Academy of Sciences91.4 (1994), pp. 1455–1459

  24. [24]

    Algebraic invariants for inferring 4-leaf semi-directed phylogenetic networks

    S. Martin, N. Holtgrefe, V . Moulton, and R. M. Leggett. “Algebraic invariants for inferring 4-leaf semi-directed phylogenetic networks”. In:Systematic Biology(2025), syaf071

  25. [25]

    On the impossibility of reconstructing ancestral data and phylogenies

    E. Mossel. “On the impossibility of reconstructing ancestral data and phylogenies”. In:Journal of Computational Biology10.5 (2003), pp. 669–676

  26. [26]

    Phase transitions in phylogeny

    E. Mossel. “Phase transitions in phylogeny”. In:Transactions of the American Mathematical Society356.6 (2004), pp. 2379–2404

  27. [27]

    Molecular studies of evolution: a source of novel statistical problems

    J. Neyman. “Molecular studies of evolution: a source of novel statistical problems”. In:Statistical decision theory and related topics. Elsevier, 1971, pp. 1–27

  28. [28]

    A note on the tree realizability of a distance matrix

    J. S. Pereira. “A note on the tree realizability of a distance matrix”. In:Journal of Combinatorial Theory6.3 (1969), pp. 303–310

  29. [29]

    Identifying circular orders for blobs in phylogenetic networks

    J. A. Rhodes, H. Baños, J. Xu, and C. Ané. “Identifying circular orders for blobs in phylogenetic networks”. In: Advances in Applied Mathematics163 (2025), p. 102804. [33]C. Semple and M. Steel.Phylogenetics. Vol. 24. Oxford University Press on Demand, 2003

  30. [30]

    Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

    C. Solís-Lemus and C. Ané. “Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting”. In:PLoS genetics12.3 (2016), e1005896. https://doi.org/10.1371/journal.pgen.1005896

  31. [31]

    Recovering a tree from the leaf colourations it generates under a Markov model

    M. Steel. “Recovering a tree from the leaf colourations it generates under a Markov model”. In:Applied Mathematics Letters7.2 (1994), pp. 19–23

  32. [32]

    Absolute convergence: true trees from short sequences

    T . Warnow, B. M. Moret, and K. St. John. “Absolute convergence: true trees from short sequences”. In:Symposium on Discrete Algorithms: Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms. Vol. 7. 09. 2001, pp. 186–195

  33. [33]

    Advances in estimating level-1 phylogenetic networks from unrooted SNPs

    T . Warnow, Y. Tabatabaee, and S. N. Evans. “Advances in estimating level-1 phylogenetic networks from unrooted SNPs”. In:Journal of Computational Biology32.1 (2025), pp. 3–27

  34. [34]

    Identifiability of local and global features of phylogenetic networks from average distances

    J. Xu and C. Ané. “Identifiability of local and global features of phylogenetic networks from average distances”. In: Journal of Mathematical Biology86.1 (2023), p. 12. 21

  35. [35]

    New absolute fast converging phylogeny estimation methods with improved scalability and accuracy

    Q. R. Zhang, S. Rao, and T . Warnow. “New absolute fast converging phylogeny estimation methods with improved scalability and accuracy”. In:18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Schloss Dagstuhl–Leibniz-Zentrum für Informatik. 2018, pp. 8–1. 22