pith. sign in

arxiv: 1407.1507 · v1 · pith:JHI3O36Ynew · submitted 2014-07-06 · 💻 cs.DS · cs.CE· q-bio.GN

KMC 2: Fast and resource-frugal k-mer counting

classification 💻 cs.DS cs.CEq-bio.GN
keywords countingallowsfastamountsapplicationsdatamemorymers
0
0 comments X
read the original abstract

Motivation: Building the histogram of occurrences of every $k$-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of $k$-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for $k$-mer counting, preferably using moderate amounts of memory. Results: We present a novel method for $k$-mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using $(k, x)$-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC~2 allows to count the 28-mers of a human reads collection with 44-fold coverage (106\,GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD. Availability: KMC~2 is freely available at http://sun.aei.polsl.pl/kmc. Contact: sebastian.deorowicz@polsl.pl

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.