pith. sign in

arxiv: 1509.04412 · v2 · pith:34A5RU6Cnew · submitted 2015-09-15 · 🧬 q-bio.QM · q-bio.GN

Orthologs from maxmer sequence context

classification 🧬 q-bio.QM q-bio.GN
keywords orthologsmatchesmaximalnon-embeddedalignmentorthologwhole-genomebest
0
0 comments X
read the original abstract

Context-dependent identification of orthologs customarily relies on conserved gene order or whole-genome sequence alignment. It is shown here that short-range context--as short as single maximal matches--also provides an effective means to identify orthologs within whole genomes. On pristine (un-repeatmasked) mammalian whole-genome assemblies we perform a genome "intersection" that in general consumes less than one thirtieth of the computation time required by commonly used methods for whole-genome alignment, and we extract "non-embedded maximal matches," maximal matches that are not embedded into other maximal matches, as potential orthologs. An ortholog identified via non-embedded maximal matches is analogous to a "positional ortholog" or a "primary ortholog" as defined in previous literature; such orthologs constitute homologs derived from the same direct ancestor whose ancestral positions in the genome are conserved. At the nucleotide level, non-embedded maximal matches recapitulate most exact matches identified by a Lastz net alignment. At the gene level, reciprocal best hits of genes containing non-embedded maximal matches recover one-to-one orthologs annotated by Ensembl Compara with high selectivity and high sensitivity; these reciprocal best hits additionally include putatively novel orthologs not found in Ensembl (e.g. over two thousand for human/chimpanzee). The method is especially suitable for genome-wide identification of orthologs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.