Recognition: unknown
Optimal Phylogenetic Reconstruction from Sampled Quartets
Pith reviewed 2026-05-10 05:50 UTC · model grok-4.3
The pith
Phylogenetic trees can be reconstructed from a linear number of noisy quartets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given m = Theta(n) quartets drawn from an unknown ground-truth tree T and then corrupted independently under the random classification noise model, there is an efficient algorithm that outputs a tree hat T close to T in quartet distance. The algorithm proceeds by repeatedly applying a Quartet-based Embedding and Detection procedure that uses semidefinite programming to identify and remove well-clustered subtrees of T. The same analysis yields a matching Theta(n) bound on the Natarajan dimension of the class of all phylogenetic trees.
What carries the argument
Quartet-based Embedding and Detection procedure, which embeds the quartets and uses semidefinite programming to identify and prune well-clustered subtrees of the unknown ground-truth tree.
If this is right
- Reconstruction succeeds with only Theta(n) samples, matching the Omega(n) information-theoretic lower bound.
- The output tree is close to the ground truth in quartet distance and therefore predicts the labels of unseen quartets.
- A (1-eps)-approximation to the maximum-agreement tree can be obtained in polynomial time.
- The class of phylogenetic trees has Natarajan dimension Theta(n), implying optimal sample complexity for learning in the multiclass PAC setting.
Where Pith is reading between the lines
- The SDP-based clustering step may extend to tree reconstruction under other common biological noise models beyond random classification noise.
- If the random classification noise model approximates real genomic data, the result indicates that far fewer sequence alignments are needed for accurate evolutionary tree inference than previously thought.
- Similar embedding-and-detection techniques could be applied to other combinatorial problems that involve recovering hidden hierarchical structure from noisy local observations.
Load-bearing premise
The observed quartets are generated by first sampling from a single fixed unknown tree and then flipping each quartet label independently with constant probability under the random classification noise model.
What would settle it
Run the algorithm on synthetic instances with m = c n samples for small constant c and measure whether the quartet distance between the output tree and the true tree remains bounded away from zero with high probability.
Figures
read the original abstract
Quartet Reconstruction, the task of recovering a phylogenetic tree from smaller trees on four species called \textit{quartets}, is a well-studied problem in theoretical computer science with far-reaching connections to statistics, graph theory and biology. Given a random sample containing $m$ noisy quartets, labeled by an unknown ground-truth tree $T$ on $n$ taxa, we want to output a tree $\widehat T$ that is \textit{close} to $T$ in terms of quartet distance and can predict unseen quartets. Unfortunately, the empirical risk minimizer corresponds to the $\mathsf{NP}$-hard problem of finding a tree that maximizes agreements with the sampled quartets, and earlier works in approximation algorithms gave $(1-\eps)$-approximation schemes (PTAS) for dense instances with $m=\Theta(n^4)$ quartets, or for $m=\Theta(n^2\log n)$ quartets \textit{randomly} sampled from $T$. Prior to our work, it was unknown how many samples are information-theoretically required to learn the tree, and whether there is an efficient reconstruction algorithm. We present optimal results for reconstructing an unknown phylogenetic tree $T$ from a random sample of $m=\Theta(n)$ quartets, corrupted under the Random Classification Noise (RCN) model. This matches the $\Omega(n)$ lower bound required for any meaningful tree reconstruction. Our contribution is twofold: first, we give a tree reconstruction algorithm that, not only achieves a $(1-\eps)$-approximation, but most importantly \textit{recovers} a tree close to $T$ in quartet distance; second, we show a new $\Theta(n)$ bound on the Natarajan dimension of phylogenies (an analog of VC dimension in multiclass classification). Our analysis relies on a new \textit{Quartet-based Embedding and Detection} procedure that identifies and removes well-clustered subtrees from the (unknown) ground-truth $T$ via semidefinite programming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to give the first efficient algorithm for reconstructing an unknown phylogenetic tree T on n taxa from a random sample of m=Θ(n) quartets corrupted by Random Classification Noise. The algorithm achieves both (1-ε)-approximation in quartet agreement and recovery of a tree close to T in quartet distance, matching the Ω(n) information-theoretic lower bound. The key technical contribution is a recursive Quartet-based Embedding and Detection procedure that uses semidefinite programming to identify and remove well-clustered subtrees; the paper also proves a new Θ(n) bound on the Natarajan dimension of the class of phylogenetic trees.
Significance. If the analysis holds, the result is significant: it closes the sample-complexity gap for phylogenetic reconstruction from quartets, improving on prior PTAS results that required Θ(n² log n) or denser samples, and supplies a new combinatorial dimension bound that may be of independent interest in learning theory. The SDP-based detection procedure is a novel algorithmic tool for exploiting tree structure under noise.
major comments (2)
- [Abstract and Quartet-based Embedding and Detection procedure] The Quartet-based Embedding and Detection procedure (Abstract and the section describing the recursive SDP step) must address detection of subtrees on k = o(n) leaves. Under uniform sampling of m = Θ(n) quartets, the expected number of quartets internal to such a subtree is Θ(n)·(k/n)⁴ = o(1), supplying essentially no direct SDP constraints. The recovery guarantee for arbitrary binary trees therefore requires an explicit argument that detection succeeds via global embedding or propagation even when local evidence is absent; without it, the claim that the algorithm recovers a tree close to T on general instances is not yet supported.
- [Recovery guarantee and analysis of SDP detection] The manuscript asserts that the SDP step succeeds with high probability under the RCN noise model and yields the claimed quartet-distance recovery. No proof sketch, concentration bound, or reference to the relevant theorem establishing this for every recursion depth appears in the provided description. Because the overall correctness rests on this step, a high-level outline of the error analysis (or a pointer to the full argument) is needed.
minor comments (2)
- [Abstract] The abstract refers to 'well-clustered subtrees' without an immediate definition; the notion should be formalized in the first section that introduces the procedure.
- [Introduction / Preliminaries] Notation for the quartet distance and the precise statement of the Natarajan-dimension bound would benefit from an early, self-contained definition before the main theorems.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major comment below, clarifying the analysis and indicating revisions to improve the exposition of the Quartet-based Embedding and Detection procedure and its guarantees.
read point-by-point responses
-
Referee: [Abstract and Quartet-based Embedding and Detection procedure] The Quartet-based Embedding and Detection procedure (Abstract and the section describing the recursive SDP step) must address detection of subtrees on k = o(n) leaves. Under uniform sampling of m = Θ(n) quartets, the expected number of quartets internal to such a subtree is Θ(n)·(k/n)⁴ = o(1), supplying essentially no direct SDP constraints. The recovery guarantee for arbitrary binary trees therefore requires an explicit argument that detection succeeds via global embedding or propagation even when local evidence is absent; without it, the claim that the algorithm recovers a tree close to T on general instances is not yet supported.
Authors: We agree that an explicit argument for the k = o(n) case is needed for full rigor. The SDP embedding is global and incorporates all Θ(n) sampled quartets, including those with leaves both inside and outside the subtree; these crossing quartets induce distance constraints that position small subtrees correctly in the embedding space relative to the rest of T. The detection step then identifies well-clustered regions via the SDP solution, and recursion propagates the structure downward. We will add a new lemma and subsection in the analysis of the procedure proving that the embedding error for any subtree of size k remains O(ε) with high probability under RCN, even when local quartets are absent, via a global-to-local propagation argument. revision: yes
-
Referee: [Recovery guarantee and analysis of SDP detection] The manuscript asserts that the SDP step succeeds with high probability under the RCN noise model and yields the claimed quartet-distance recovery. No proof sketch, concentration bound, or reference to the relevant theorem establishing this for every recursion depth appears in the provided description. Because the overall correctness rests on this step, a high-level outline of the error analysis (or a pointer to the full argument) is needed.
Authors: The full proof appears in Section 5 (Theorems 5.1–5.3), using matrix concentration for the SDP under RCN and induction on recursion depth to bound quartet-distance error. To make this accessible, we will insert a high-level proof sketch immediately after the procedure description, summarizing: (i) the SDP relaxation approximates the true tree metric with additive error controlled by the noise rate via a matrix Bernstein bound; (ii) the rounding and clustering step recovers subtrees with quartet distance O(ε) w.h.p.; (iii) the inductive hypothesis carries the (1-ε) approximation and recovery guarantees through all recursion levels. We will also add explicit pointers to the relevant theorems. revision: yes
Circularity Check
No significant circularity; new SDP procedure and Natarajan dimension bound are independently derived
full rationale
The paper introduces a Quartet-based Embedding and Detection procedure using SDP on sampled quartets and separately proves a new Θ(n) Natarajan dimension bound for phylogenies. Neither reduces to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The claimed (1-ε)-approximation and recovery with m=Θ(n) samples follows from the analysis of this procedure rather than assuming the output tree. The matching Ω(n) lower bound is a general information-theoretic requirement, not derived from the algorithm itself. This is the expected outcome for a self-contained algorithmic paper with external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Semidefinite programming can be solved to sufficient accuracy in polynomial time for the embedding and detection procedure.
- domain assumption Random Classification Noise flips each quartet label independently with fixed probability.
Reference graph
Works this paper leans on
-
[1]
2001 , publisher=
Random Graphs , author=. 2001 , publisher=
2001
-
[2]
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=
A cost function for similarity-based hierarchical clustering , author=. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=
-
[3]
Advances in Neural Information Processing Systems , volume=
Subsampled power iteration: a unified algorithm for block models and planted csp's , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
Optimal Sample Complexity of Contrastive Learning , year =
Alon, Noga and Avdiukhin, Dmitrii and Elboim, Dor and Fischer, Orr and Yaroslavtsev, Grigory , booktitle =. Optimal Sample Complexity of Contrastive Learning , year =
-
[5]
Transactions of the American Mathematical Society , volume=
Lower bounds for approximation by nonlinear manifolds , author=. Transactions of the American Mathematical Society , volume=. 1968 , publisher=
1968
-
[6]
Grothendieck, Alexandre , volume=. R. 1956 , publisher=
1956
-
[7]
Constantes de
Krivine, Jean-Louis , journal=. Constantes de
-
[8]
SIAM Journal on Computing , volume=
Beating the random ordering is hard: Every ordering CSP is approximation resistant , author=. SIAM Journal on Computing , volume=. 2011 , publisher=
2011
-
[9]
Zootaxa , volume=
The tree of life web project , author=. Zootaxa , volume=. 2007 , publisher=
2007
-
[10]
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing , pages=
Strongly refuting random csps below the spectral threshold , author=. Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing , pages=
-
[11]
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=
The threshold for SDP-refutation of random regular NAE-3SAT , author=. Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages=. 2019 , organization=
2019
-
[12]
Annals of Applied Probability , pages=
Broadcasting on trees and the Ising model , author=. Annals of Applied Probability , pages=. 2000 , publisher=
2000
-
[13]
Annals of Applied Probability , pages=
Reconstruction on trees: beating the second eigenvalue , author=. Annals of Applied Probability , pages=. 2001 , publisher=
2001
-
[14]
Journal of Bioinformatics and Computational Biology , volume=
Inferring phylogenetic relationships avoiding forbidden rooted triplets , author=. Journal of Bioinformatics and Computational Biology , volume=. 2006 , publisher=
2006
-
[15]
Probability Theory and Related Fields , volume=
Evolutionary trees and the Ising model on the Bethe lattice: a proof of Steel’s conjecture , author=. Probability Theory and Related Fields , volume=. 2011 , publisher=
2011
-
[16]
The Thirty Sixth Annual Conference on Learning Theory , pages=
Learning and Testing Latent-Tree Ising Models Efficiently , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=
2023
-
[17]
Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages=
Satisfiability threshold for random regular NAE-SAT , author=. Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages=
-
[18]
Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages=
Proof of the satisfiability conjecture for large k , author=. Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages=
-
[19]
2015 IEEE 56th Annual Symposium on Foundations of Computer Science , pages=
How to refute a random CSP , author=. 2015 IEEE 56th Annual Symposium on Foundations of Computer Science , pages=. 2015 , organization=
2015
-
[20]
Journal of Combinatorial Optimization , volume=
Constructing the maximum consensus tree from rooted triples , author=. Journal of Combinatorial Optimization , volume=. 2004 , publisher=
2004
-
[21]
PLoS One , volume=
Oldest known pantherine skull and evolution of the tiger , author=. PLoS One , volume=. 2011 , publisher=
2011
-
[22]
1979 , publisher=
On the complexity of the satisfiability problem , author=. 1979 , publisher=
1979
-
[23]
Discrete Applied Mathematics , volume=
Probabilistic analysis of the Davis Putnam procedure for solving the satisfiability problem , author=. Discrete Applied Mathematics , volume=. 1983 , publisher=
1983
-
[24]
Journal of the ACM (JACM) , volume=
Many hard examples for resolution , author=. Journal of the ACM (JACM) , volume=. 1988 , publisher=
1988
-
[25]
Aaai , volume=
Hard and easy distributions of SAT problems , author=. Aaai , volume=
-
[26]
Advances in applied mathematics , volume=
Extension operations on sets of leaf-labeled trees , author=. Advances in applied mathematics , volume=. 1995 , publisher=
1995
-
[27]
Mathematical biosciences , volume=
Closure operations in phylogenetics , author=. Mathematical biosciences , volume=. 2007 , publisher=
2007
-
[28]
Transactions of the American Mathematical Society , volume=
Phase transitions in phylogeny , author=. Transactions of the American Mathematical Society , volume=
-
[29]
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing , pages=
Optimal phylogenetic reconstruction , author=. Proceedings of the thirty-eighth annual ACM symposium on Theory of computing , pages=
-
[30]
British Journal of Mathematical and Statistical Psychology , volume=
Tree structures for proximity data , author=. British Journal of Mathematical and Statistical Psychology , volume=. 1981 , publisher=
1981
-
[31]
Science , volume=
Critical behavior in the satisfiability of random boolean expressions , author=. Science , volume=. 1994 , publisher=
1994
-
[32]
Artificial intelligence , volume=
Generating hard satisfiability problems , author=. Artificial intelligence , volume=. 1996 , publisher=
1996
-
[33]
Proceedings., 33rd Annual Symposium on Foundations of Computer Science , pages=
Mick gets some (the odds are on his side)(satisfiability) , author=. Proceedings., 33rd Annual Symposium on Foundations of Computer Science , pages=. 1992 , organization=
1992
-
[34]
Journal of the ACM (JACM) , volume=
A computing procedure for quantification theory , author=. Journal of the ACM (JACM) , volume=. 1960 , publisher=
1960
-
[35]
Communications of the ACM , volume=
A machine program for theorem-proving , author=. Communications of the ACM , volume=. 1962 , publisher=
1962
-
[36]
Molecular Phylogenetics and Evolution , volume=
Supermatrix and species tree methods resolve phylogenetic relationships within the big cats, Panthera (Carnivora: Felidae) , author=. Molecular Phylogenetics and Evolution , volume=. 2010 , publisher=
2010
-
[37]
SIAM Journal on Computing , volume=
Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions , author=. SIAM Journal on Computing , volume=. 1981 , publisher=
1981
-
[38]
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages=
On the power of unique 2-prover 1-round games , author=. Proceedings of the thiry-fourth annual ACM symposium on Theory of computing , pages=
-
[39]
Discrete Applied Mathematics , volume=
The approximability of maximum rooted triplets consistency with fan triplets and forbidden triplets , author=. Discrete Applied Mathematics , volume=. 2019 , publisher=
2019
-
[40]
Discrete Applied Mathematics , volume=
New results on optimizing rooted triplets consistency , author=. Discrete Applied Mathematics , volume=. 2010 , publisher=
2010
-
[41]
Proceedings of the twenty-seventh annual ACM symposium on Theory of computing , pages=
Polynomial time approximation schemes for dense instances of NP-hard problems , author=. Proceedings of the twenty-seventh annual ACM symposium on Theory of computing , pages=
-
[42]
Mathematical programming , volume=
A new rounding procedure for the assignment problem with applications to dense graph arrangement problems , author=. Mathematical programming , volume=. 2002 , publisher=
2002
-
[43]
SIAM Journal on Computing , volume=
A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application , author=. SIAM Journal on Computing , volume=. 2001 , publisher=
2001
-
[44]
SIAM Journal on Discrete Mathematics , volume=
On the compatibility of quartet trees , author=. SIAM Journal on Discrete Mathematics , volume=. 2014 , publisher=
2014
-
[45]
SIAM Journal on Discrete Mathematics , volume=
On the maximum quartet distance between phylogenetic trees , author=. SIAM Journal on Discrete Mathematics , volume=. 2016 , publisher=
2016
-
[46]
Journal of computational Biology , volume=
Constructing phylogenies from quartets: elucidation of eutherian superordinal relationships , author=. Journal of computational Biology , volume=
-
[47]
IEEE/ACM Transactions on Computational Biology and Bioinformatics , volume=
Quartets MaxCut: a divide and conquer quartets algorithm , author=. IEEE/ACM Transactions on Computational Biology and Bioinformatics , volume=. 2008 , publisher=
2008
-
[48]
Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages=
A characterization of strong approximation resistance , author=. Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages=
-
[49]
Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat
Orchestrating quartets: approximation and data correction , author=. Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No. 98CB36280) , pages=. 1998 , organization=
1998
-
[50]
SIAM Journal on Computing , volume=
Reconstructing approximate phylogenetic trees from quartet samples , author=. SIAM Journal on Computing , volume=. 2012 , publisher=
2012
-
[51]
Journal of Computational Biology , volume=
Short quartet puzzling: A new quartet-based phylogeny reconstruction algorithm , author=. Journal of Computational Biology , volume=. 2008 , publisher=
2008
-
[52]
Molecular phylogenetics and evolution , volume=
Quartet MaxCut: a fast algorithm for amalgamating quartet trees , author=. Molecular phylogenetics and evolution , volume=. 2012 , publisher=
2012
-
[53]
Systematic biology , volume=
Weighted quartets phylogenetics , author=. Systematic biology , volume=. 2015 , publisher=
2015
-
[54]
2004 , publisher=
Inferring phylogenies , author=. 2004 , publisher=
2004
-
[55]
Systematic Biology , volume=
Consensus techniques and the comparison of taxonomic trees , author=. Systematic Biology , volume=. 1972 , publisher=
1972
-
[56]
In KDD workshop on text mining, volume 400
A comparison of document clustering techniques , author=. In KDD workshop on text mining, volume 400. Boston , year=
-
[57]
Journal of the ACM (JACM) , volume=
Some optimal inapproximability results , author=. Journal of the ACM (JACM) , volume=. 2001 , publisher=
2001
-
[58]
Bioconsensus: DIMACS Working Group Meetings on Bioconsensus: October 25-26, 2000 and October 2-5, 2001, DIMACS Center , volume=
A Classification of Consensus Methods for Phylogenetics , author=. Bioconsensus: DIMACS Working Group Meetings on Bioconsensus: October 25-26, 2000 and October 2-5, 2001, DIMACS Center , volume=. 2003 , organization=
2000
-
[59]
Trends in Ecology & Evolution , volume=
Phylogenetic supertrees: assembling the trees of life , author=. Trends in Ecology & Evolution , volume=. 1998 , publisher=
1998
-
[60]
Systematic Biology , volume=
Simple but fundamental limitations on supertree and consensus tree methods , author=. Systematic Biology , volume=. 2000 , publisher=
2000
-
[61]
Algorithmica , volume=
Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology , author=. Algorithmica , volume=. 1999 , publisher=
1999
-
[62]
Information Processing Letters , volume=
On the agreement of many trees , author=. Information Processing Letters , volume=. 1995 , publisher=
1995
-
[63]
SIAM Journal on Computing , volume=
Computing the local consensus of trees , author=. SIAM Journal on Computing , volume=. 1998 , publisher=
1998
-
[64]
Discrete applied mathematics , volume=
Reconstruction of rooted trees from subtrees , author=. Discrete applied mathematics , volume=. 1996 , publisher=
1996
-
[65]
Discrete Applied Mathematics , volume=
A supertree method for rooted trees , author=. Discrete Applied Mathematics , volume=. 2000 , publisher=
2000
-
[66]
Journal of the ACM (JACM) , volume=
Improved algorithms for constructing consensus trees , author=. Journal of the ACM (JACM) , volume=. 2016 , publisher=
2016
-
[67]
Journal of the American statistical association , volume=
Hierarchical grouping to optimize an objective function , author=. Journal of the American statistical association , volume=. 1963 , publisher=
1963
-
[68]
Proceedings of the National Academy of Sciences , volume=
Cluster analysis and display of genome-wide expression patterns , author=. Proceedings of the National Academy of Sciences , volume=. 1998 , publisher=
1998
-
[69]
Grouping multidimensional data: Recent advances in clustering , pages=
A survey of clustering data mining techniques , author=. Grouping multidimensional data: Recent advances in clustering , pages=. 2006 , publisher=
2006
-
[70]
2020 , publisher=
Mining of massive data sets , author=. 2020 , publisher=
2020
-
[71]
Nickel, Maximillian and Kiela, Douwe , journal=. Poincar
-
[72]
Nature , volume=
Hierarchical structure and the prediction of missing links in networks , author=. Nature , volume=. 2008 , publisher=
2008
-
[73]
Physical review E , volume=
Hierarchical organization in complex networks , author=. Physical review E , volume=. 2003 , publisher=
2003
-
[74]
2009 , publisher=
The elements of statistical learning: data mining, inference, and prediction , author=. 2009 , publisher=
2009
-
[75]
Journal of the Royal Statistical Society: Series A (General) , volume=
A review of hierarchical classification , author=. Journal of the Royal Statistical Society: Series A (General) , volume=. 1987 , publisher=
1987
-
[76]
European Journal of Combinatorics , volume=
Low dimensional embeddings of ultrametrics , author=. European Journal of Combinatorics , volume=. 2004 , publisher=
2004
-
[77]
Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms , pages=
Efficient algorithms for computing the triplet and quartet distance between trees of arbitrary degree , author=. Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms , pages=. 2013 , organization=
2013
-
[78]
Approximation, Randomization and Combinatorial Optimization
Beating a random assignment , author=. Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques , pages=. 2005 , publisher=
2005
-
[79]
Approximation, Randomization, and Combinatorial Optimization
On the approximation resistance of a random predicate , author=. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , pages=. 2007 , publisher=
2007
-
[80]
Proceedings of the forty-first annual ACM symposium on Theory of computing , pages=
Randomly supported independence and resistance , author=. Proceedings of the forty-first annual ACM symposium on Theory of computing , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.