Speeding-up $q$-gram mining on grammar-based compressed texts

Hideo Bannai; Keisuke Goto; Masayuki Takeda; Shunsuke Inenaga

arxiv: 1202.3311 · v1 · pith:V6WS4CI6new · submitted 2012-02-15 · 💻 cs.DS

Speeding-up q-gram mining on grammar-based compressed texts

Keisuke Goto , Hideo Bannai , Shunsuke Inenaga , Masayuki Takeda This is my paper

classification 💻 cs.DS

keywords algorithmmathcalfrequenciesgrammathitproblemcompressedgrams

0 comments

read the original abstract

We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to $q$-grams. The reduced problem can be solved in linear time. Since $m = O(qn)$, the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous $O(qn)$ algorithm when $q = \Omega(|T|/n)$.

This paper has not been read by Pith yet.

Speeding-up q-gram mining on grammar-based compressed texts

discussion (0)