pith. sign in

arxiv: 1301.3488 · v1 · pith:OIHURMRYnew · submitted 2013-01-15 · 💻 cs.DS · cs.DM· cs.IR

Various improvements to text fingerprinting

classification 💻 cs.DS cs.DMcs.IR
keywords maximalsigmafingerprintlocationsalphabetanswercharacterscompute
0
0 comments X
read the original abstract

Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet \Sigma of size \sigma. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n \sigma). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in \Sigma, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.