pith. sign in

arxiv: 1109.2930 · v4 · pith:VKOU56JUnew · submitted 2011-09-13 · 💻 cs.DS

Faster Approximate Pattern Matching in Compressed Repetitive Texts

classification 💻 cs.DS
keywords patterntimeapproximatedatalengthmatchingshowedstring
0
0 comments X
read the original abstract

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with $r$ rules for a string $s$ of length $n$, we can build an $\Oh{r}$-word data structure that allows us to extract any substring of length $m$ in $\Oh{\log n + m}$ time. They also showed how, given a pattern $p$ of length $m$ and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to $p$ in $s$ in $\Oh{r (\min (m k, k^4 + m) + \log n) + \occ}$ time. Rytter (2003) and Charikar et al. (2005) showed that $r$ is always at least the number $z$ of phrases in the LZ77 parse of $s$, and gave algorithms for building straight-line programs with $\Oh{z \log n}$ rules. In this paper we give a simple $\Oh{z \log n}$-word data structure that takes the same time for substring extraction but only $\Oh{z \min (m k, k^4 + m) + \occ}$ time for approximate pattern matching.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.