Sunday, December 9, 2012

Finding a pattern supersonic fast!

Hi all,
Today I'll be talking about a common task which is widely used on our daily basis. Googling is widespread task for searching a phrase/word/sentence in a text.
The Linux users are probably  familiar with the grep command which is used quite often,
Have you ever thought to yourself how we get the results that fast??

Searching for a string in a given text is so commonly operation, would you like to know how is it implemented efficiently?

In case you have answered to the last question with a "YES", hop in and enjoy today's journey I'll be explaining those neat algorithms! :-)


We will start with a simple warm-up naive algorithm, and along the way we will get to know interesting algorithms.

As you might know a simple naive algorithms focuses on each character from the pattern, and checks a match in the text, in case there is a match we check the next character, this goes till we have a complete match of the pattern, or along the way we reached a mismatch.

in case we have a mismatch we return back and advance the pointer
by one, and repeat the method till we have reached the end of the text.

Complexity is O(m*n)


So let's see how  Knuth-Morris-Pratt algorithm overcomes those drawbacks of the naive algorithm.
As we can see the main flaw in the naive algorithm is wasting time traversing over characters which were checked already. That's a substantially waste of time, Since we reach to the point there is no match we already should know if a prefix to our pattern already occurred till this point.
In case no prefix appeared we should start the new search for the pattern from this current position, otherwise we should start the new search from the prefix which was found.

Example #1:
===========


i
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Text
a
b
c
d
e
f
a
b
c
d
a
b
c
e
f
g


j
0
1
2
3
Text
a
b
c
e


  1. Comparison and a match between Text[0-2] to pattern[0-2] at index 3 we fail,
  2. Now We start a new comparison between i = 3 and j = 0, there is no match so     we advance the index i by one...and again by one since no match.
  3. Comparison and a match between Text[6-8] to pattern[0-2] at index 3 we fail.
  4. Now We start a new comparison between i = 9 and j = 0, there is no match so we advance the index i by one.
  5. Comparison and a match between Text[10-13] to pattern[0-3], a full match was found!!!!!!!


Example #2 (A pattern with prefix embedded into it)
==============================================


i
0
1
2
3
4
5
6
7
8
9
10
11
12
Text
a
b
a
d
e
f
a
b
c
a
a
b
c


j
0
1
2
3
4
Text
a
b
a
b
c

  1.  Comparison and a match between Text[0-2] to pattern[0-2] at index 3 we fail,
  2. Now We start a new comparison between i = 2 and j = 0 which passes but the consecutive character fails! so we advance the index i by one
  3. We reach i=6, Comparison and a match between Text[6-7] to pattern[0-1], but fail at index 2.
  4. Now We start a new comparison between i = 8 and j = 0, there is no match so we advance the index i by one.
  5. Comparison and a match between Text[9] to pattern[0], but fail immediately.

The key for the Knuth-Morris-Pratt is to hold an array which would hold positions of the longest prefix of the pattern.
This array can be generated and initialized at the beginning of the algorithms at the preprocessing state.
so for example #1:
the array should look like:


i
0
1
2
3
array
0
0
0
0


all the cells hold zero since the pattern does not consist any prefix of itself.


so for example #2:
the array should look like:


i
0
1
2
3
4
array
0
0
1
2
0


Explanation: We can see the prefix of the pattern "ab" appears at index 2 and 3,
so array value is 1 and 2.

I have written the preprocessing part on a piece of paper, and made sure It works well, so it goes like this:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static void set_prefix_table(char pattern[] ,unsigned short prefix_array[])
{
    unsigned short i,j;
    unsigned short prefix_length = 0;
    prefix_array[0] = 0;
    j = 0;
    i = 1;
    while (i < M)
    {
        while (pattern[i] == pattern[j])
        {
            prefix_length++;
            prefix_array[i] = prefix_length;
            i++;
            j++;
        }
        prefix_length = 0;
        if (j == 0) 
        {
            prefix_array[i] = 0;
            i++;
        }
        else j = 0;        
    }    
}

calculating the array prefix_array takes O(m), when m is the length of the pattern.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static unsigned short get_first_appearance(char Text[],char pattern[] ,unsigned short prefix_array[])
{
    unsigned short i = 0;
    unsigned short start_point_of_pattern = 0;
    unsigned short len_pattern = strlen(pattern);
    unsigned short len_text = strlen(Text);

    while ( start_point_of_pattern + len_pattern <= len_text)
    {
        if (Text[start_point_of_pattern + i] == pattern[i])
        {
            if (i == len_pattern) return start_point_of_pattern;
            i++;
        }
        else 
        {
            start_point_of_pattern = start_point_of_pattern + i - prefix_array[i]; 
            i = prefix_array[i];
        }    
    }

    return -1;
}

eventually the complexity of the algorithm is O(m+n) much more efficient!

I'll be posting in following days an animation of the algorithm which I'm working on via Actionscript, so stay tuned! :-)

No comments:

Post a Comment

About