pith. sign in

arxiv: 1511.01956 · v2 · pith:3G2PWLMPnew · submitted 2015-11-05 · 🧬 q-bio.PE

Statistically-Consistent k-mer Methods for Phylogenetic Tree Reconstruction

classification 🧬 q-bio.PE
keywords treedistancemethodssequencesalignmentfirstfrequenciesmultiple
0
0 comments X
read the original abstract

Frequencies of $k$-mers in sequences are sometimes used as a basis for inferring phylogenetic trees without first obtaining a multiple sequence alignment. We show that a standard approach of using the squared-Euclidean distance between $k$-mer vectors to approximate a tree metric can be statistically inconsistent. To remedy this, we derive model-based distance corrections for orthologous sequences without gaps, which lead to consistent tree inference. The identifiability of model parameters from $k$-mer frequencies is also studied. Finally, we report simulations showing the corrected distance out-performs many other $k$-mer methods, even when sequences are generated with an insertion and deletion process. These results have implications for multiple sequence alignment as well, since $k$-mer methods are usually the first step in constructing a guide tree for such algorithms.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.