pith. sign in

arxiv: 2604.09509 · v1 · submitted 2026-04-10 · 🧮 math.PR · q-bio.PE

An Improved Bipartition Cover Bound for the Multispecies Coalescent Model

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 🧮 math.PR q-bio.PE
keywords bipartition covermultispecies coalescentgene treesspecies treeupper boundsKingman's coalescentASTRALcoalescence times
0
0 comments X

The pith

New upper bounds show that fewer loci are needed to cover every species tree bipartition with high probability under the multispecies coalescent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives topology-free upper bounds on the number of independent gene trees required so that every bipartition of the true species tree appears at least once with prescribed probability. These bounds improve on earlier results and remain below typical empirical numbers of loci over a wider range of divergence times and population sizes. The improvement comes from new asymptotic analysis of coalescence waiting times when internal branches are short. A sympathetic reader would care because summary methods such as ASTRAL rely on the gene-tree collection containing a complete set of bipartitions to guarantee consistency.

Core claim

Under the multispecies coalescent, the probability that a random gene tree realizes a given bipartition is controlled by the waiting times for lineage coalescence within the species tree branches. By developing fresh asymptotics for these waiting times in the short-branch regime and combining them with concentration inequalities, the authors obtain explicit upper bounds on the number of loci that are smaller than previous topology-free bounds and that do not depend on the particular species-tree shape.

What carries the argument

Bipartition cover probability under the multispecies coalescent, sharpened by new short-branch asymptotics for absorption times in Kingman's coalescent.

If this is right

  • Methods that rely on bipartition coverage, such as ASTRAL, can use smaller sample sizes while retaining finite-sample guarantees.
  • The bounds apply uniformly to any species-tree topology, removing the need to know the tree shape in advance.
  • New coalescence asymptotics also give sharper control on the distribution of gene-tree topologies when internal branches are short.
  • Simulations across varied topologies confirm that the analytic bounds remain conservative yet tighter than earlier work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These bounds could be used to decide in advance how many loci to sequence for a given expected branch length and population size.
  • The short-branch analysis may extend to other coalescent models whose waiting-time distributions admit similar expansions.
  • If the bounds are close to tight, they offer a practical check on whether an existing multi-locus dataset already satisfies the coverage condition for reliable species-tree inference.

Load-bearing premise

The derivations assume no recombination inside loci, no migration, constant population sizes, and that the short-branch asymptotic regime accurately describes the relevant parameter range.

What would settle it

A simulation or empirical dataset in which the observed number of loci required to achieve a complete bipartition cover exceeds the new bound for a known species tree would show the bound is not valid.

read the original abstract

Bipartition cover probabilities quantify whether a collection of gene trees contains every bipartition of the underlying species tree, a condition that underlies finite-sample guarantees for summary methods such as ASTRAL. We study this problem under the multispecies coalescent (MSC) model and derive topology-free upper bounds on the number of loci required to obtain a bipartition cover with prescribed confidence, improving upon the existing bounds of Uricchio et al. (2016). Practically, our bounds remain below biologically realistic numbers of loci across a substantially broader range of parameter settings, expanding their usefulness for empirical datasets. Theoretically, our analysis sharpens our understanding of coalescence under the MSC model and develops new asymptotics for these bounds and absorption times under Kingman's coalescent in the natural short branch regime. We further compare our new bounds with existing work using simulations under a variety of different species-tree topologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript derives improved topology-free upper bounds on the number of loci needed to achieve a bipartition cover with prescribed probability under the multispecies coalescent. The bounds rely on new asymptotic expansions for absorption times in the short-branch regime of Kingman's coalescent and are claimed to hold uniformly over species-tree topologies, improving on Uricchio et al. (2016). Explicit derivations are provided along with simulations across multiple topologies that numerically support the bounds remaining below realistic locus counts in wider parameter regimes.

Significance. If the new asymptotics and uniform topology-free property hold, the work supplies tighter, more usable finite-sample guarantees for summary phylogenomic methods such as ASTRAL and advances the theoretical analysis of coalescence times under the MSC. The parameter-free character of the derivations and the explicit comparison via simulation are strengths that would make the bounds directly applicable to empirical study design.

major comments (2)
  1. [§3.2] §3.2 (worst-case analysis): The topology-free claim is obtained by taking a supremum over all species-tree topologies, yet the manuscript provides no proof that the supremum is attained nor quantifies the gap between the bound and the maximum realized over topologies. Simulations sample only a limited collection of topologies, so the uniform improvement over Uricchio et al. is not fully substantiated.
  2. [§4.1, Eq. (15)] §4.1, Eq. (15): The refined short-branch expansion for absorption times is central to the locus-count bound, but the order of the remainder term is not explicitly controlled for the moderate branch-length values used in the simulations; this leaves open whether the upper bound remains valid outside the extreme short-branch limit.
minor comments (3)
  1. The abstract asserts that the bounds 'remain below biologically realistic numbers of loci' across broader regimes, but the main text should cite specific empirical locus counts or references that define 'biologically realistic'.
  2. [Simulation section] Simulation results would be strengthened by reporting the number of Monte Carlo replicates and any variability (standard errors or ranges) rather than point estimates alone.
  3. Notation for the bipartition-cover probability p_n and the absorption-time random variable could be introduced in the introduction or model section to improve readability for readers outside coalescent theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the detailed comments on our manuscript. We address the major comments point by point below, and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (worst-case analysis): The topology-free claim is obtained by taking a supremum over all species-tree topologies, yet the manuscript provides no proof that the supremum is attained nor quantifies the gap between the bound and the maximum realized over topologies. Simulations sample only a limited collection of topologies, so the uniform improvement over Uricchio et al. is not fully substantiated.

    Authors: The topology-free bound is derived by taking the supremum over all possible topologies in the expression for the absorption probabilities, which ensures that the resulting upper bound on the number of loci holds uniformly for any species tree. While we have not provided a formal proof that this supremum is attained at a specific topology or quantified the exact gap to the realized maximum, the bound is conservative by construction. Our simulations across multiple topologies, including balanced and unbalanced trees with varying numbers of taxa, numerically confirm that the bound is not exceeded and improves upon Uricchio et al. (2016). In revision, we will expand the simulation section to include additional topologies and add a remark clarifying the nature of the supremum-based bound. revision: partial

  2. Referee: [§4.1, Eq. (15)] §4.1, Eq. (15): The refined short-branch expansion for absorption times is central to the locus-count bound, but the order of the remainder term is not explicitly controlled for the moderate branch-length values used in the simulations; this leaves open whether the upper bound remains valid outside the extreme short-branch limit.

    Authors: We agree that the asymptotic expansion in Eq. (15) is derived under the short-branch regime, and while the leading terms provide the refined bound, the remainder term's order is not explicitly bounded for moderate branch lengths. However, the overall locus-count bound is still valid as an upper bound because it is constructed conservatively using the expansion. In the simulations, we used branch lengths that are short but not extreme, and the numerical results support the bound holding. To address this, we will revise the manuscript to include an explicit error bound or a discussion of the validity range for the approximation, perhaps by deriving a uniform error estimate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central claims rest on explicit derivations of new short-branch asymptotics for absorption times under Kingman's coalescent, applied uniformly over species-tree topologies under standard MSC assumptions. These asymptotics are developed from first principles rather than fitted parameters or self-referential definitions, and the topology-free upper bounds on loci for bipartition cover are obtained via worst-case analysis that does not reduce to the inputs by construction. Simulations across topologies provide independent numerical confirmation, and the cited prior work (Uricchio et al. 2016) is external rather than a self-citation chain. No load-bearing step equates a prediction to a fitted input or renames a known result as a novel derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The bounds rest on the standard multispecies coalescent framework and Kingman's coalescent; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Standard assumptions of the multispecies coalescent (no intra-locus recombination, no migration, constant effective population sizes)
    Invoked throughout to model gene-tree discordance.
  • domain assumption Properties of absorption times under Kingman's coalescent in the short-branch regime
    Used to obtain the new asymptotic bounds.

pith-pipeline@v0.9.0 · 5445 in / 1234 out tokens · 50503 ms · 2026-05-10T16:31:58.634365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Thomas Hofmann, Bernhard Sch¨ olkopf, and Alexander J Smola

    doi: 10.1214/aos/1176346604. URLhttps://doi.org/10.1214/aos/1176346604. Meitner Cadena. A note on tauberian theorems of exponential type,

  2. [2]

    org/abs/1503.08793

    URLhttps://arxiv. org/abs/1503.08793. Hua Chen and Kun Chen. Asymptotic distributions of coalescence times and ancestral lineage numbers for populations with temporally varying size.Genetics, 194(3):721–736,

  3. [3]

    doi: 10.1534/genetics.113.151522. Gary D. Eppen. Convexity-preserving transformations with stochastic matrices.SIAM Journal on Applied Mathematics, 14(2):234–236,

  4. [4]

    Accessed: 2025-11-04

    URLhttp://www.jstor.org/stable/2946261. Accessed: 2025-11-04. Samuel Karlin and James McGregor. Coincidence probabilities.Pacific Journal of Mathematics, 9(4):1141–1164,

  5. [5]

    Sophie J Kersting, Kristina Wicke, and Mareike Fischer

    URL https://arxiv.org/abs/1712.07553. Sophie J Kersting, Kristina Wicke, and Mareike Fischer. Tree balance in phylogenetic models. Philosophical Transactions of the Royal Society B: Biological Sciences, 380(1919):20230303,

  6. [6]

    URLhttps://doi.org/10.1098/rstb.2023.0303

    doi: 10.1098/rstb.2023.0303. URLhttps://doi.org/10.1098/rstb.2023.0303. Nicholas J. Loman and Mark J. Pallen. Sequencing of bacterial genomes: principles and in- sights.Microbiology and Molecular Biology Reviews, 79(3):363–385,

  7. [7]

    00009-15

    doi: 10.1128/MMBR. 00009-15. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC3927574/. Zachary McNulty. msc bipartition cover: Code repository.https://github.com/zackmcnulty/ msc_bipartition_cover,

  8. [8]

    Accessed: 2026-04-03

    GitHub repository. Accessed: 2026-04-03. Siavash Mirarab and Tandy Warnow. Astral-ii: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes.Bioinformatics, 31(12):i44–i52,

  9. [9]

    URLhttps://doi.org/10.1093/bioinformatics/btv234

    doi: 10.1093/bioinformatics/btv234. URLhttps://doi.org/10.1093/bioinformatics/btv234. Siavash Mirarab, Rezwana Reaz, Md Shamsuzzoha Bayzid, Tandy Zimmermann, M Shel Swen- son, and Tandy Warnow. Astral: genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17):i541–i548,

  10. [10]

    URLhttps: //doi.org/10.1093/bioinformatics/btu462

    doi: 10.1093/bioinformatics/btu462. URLhttps: //doi.org/10.1093/bioinformatics/btu462. Martin M¨ ohle and Helmut Pitters. Absorption time and tree length of the Kingman coalescent and the Gumbel distribution.Markov Processes and Related Fields, 21(2):317–338,

  11. [11]

    URLhttps://www.nature.com/articles/ s41586-023-06490-x

    doi: 10.1038/s41586-023-06490-x. URLhttps://www.nature.com/articles/ s41586-023-06490-x. Bruce Rannala and Ziheng Yang. The multispecies coalescent model and species tree inference. In C´ edric Scornavacca, Fr´ ed´ eric Delsuc, and Nicolas Galtier, editors,Phylogenetics in the Genomic Era, chapter 3.3. No commercial publisher — Authors open access book,

  12. [12]

    33 Moshe Shaked and J

    URLhttps://arxiv.org/abs/1404.5886. 33 Moshe Shaked and J. George Shanthikumar.Stochastic Orders. Springer Series in Statistics. Springer, 1st edition,

  13. [13]

    George , TITLE =

    ISBN 978-0-387-32915-4. doi: 10.1007/978-0-387-34675-5. Shameek Shekhar, Sebastien Roch, and Siavash Mirarab. Species tree estimation using astral: How many genes are enough?IEEE/ACM Transactions on Computational Biology and Bioinformat- ics, 15(5):1738–1747, sep

  14. [14]

    Epub 2017 Sep

    doi: 10.1109/TCBB.2017.2757930. Epub 2017 Sep

  15. [15]

    George P

    doi: 10.1186/s12859-016-1266-4. George P. Yanev. Exponential and hypoexponential distributions: Some characterizations.Math- ematics, 8(12):2207, December

  16. [16]

    doi: 10.3390/math8122207

    ISSN 2227-7390. doi: 10.3390/math8122207. URL http://dx.doi.org/10.3390/math8122207. Ziheng Yang and Bruce Rannala. Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci.Genetics, 164(4):1645–1656,

  17. [17]

    Chao Zhang and Siavash Mirarab

    URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1462670/pdf/12930768.pdf. Chao Zhang and Siavash Mirarab. Weighting by gene tree uncertainty improves accuracy of quartet- based species trees.Molecular Biology and Evolution, 39(12):msac215,

  18. [18]

    doi: 10.1093/molbev/msac215

    ISSN 1537-1719. doi: 10.1093/molbev/msac215. URLhttps://doi.org/10.1093/molbev/msac215. Chao Zhang, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. Astral-iii: polynomial time species tree reconstruction from partially resolved gene trees.BMC Bioinformatics, 19(Suppl 6):153,

  19. [19]

    doi: 10.1186/s12859-018-2129-y. 34