pith. sign in

arxiv: 1701.06279 · v1 · pith:POVW3SF5new · submitted 2017-01-23 · 🧬 q-bio.QM · cs.CL· cs.LG· stat.ML

dna2vec: Consistent vector representations of variable-length k-mers

classification 🧬 q-bio.QM cs.CLcs.LGstat.ML
keywords dna2vecvectorvectorsk-merk-mersmethodone-hotrepresentations
0
0 comments X
read the original abstract

One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.