Knuth–Morris–Pratt the KMP Algorithm

String Matching ①

2 years ago
571
2 min

First acquaintance with BF and KMP algorithms

Firstly, let’s look at these two codes:

int BF()
{
    auto pattern_size = pattern.size();
    auto text_size = text.size();
    auto pattern_index = 0;
    auto text_index = 0;

    while (pattern_index < pattern_size && text_index < text_size)
    {
        if (pattern[pattern_index] == text[text_index]) //success
        {
            ++text_index;
            ++pattern_index;
        }
        else  //failure
        {
            text_index -= pattern_index - 1;
            pattern_index = 0;
        }
    }
    if (pattern_index == pattern_size)    return text_index - pattern_index;
    return -1;
}
int KMP()
{
    auto pattern_size = pattern.size();
    auto text_size = text.size();
    auto pattern_index = 0;
    auto text_index = 0;

    while (pattern_index < pattern_size && text_index < text_size)
    {
        if (pattern_index == -1 || pattern[pattern_index] == text[text_index])  //success
        {
            ++text_index;
            ++pattern_index;
        }
        else    pattern_index = next[pattern_index];  //failure
    }
    if (pattern_index == pattern_size)    return text_index - pattern_index;
    return -1;
}

The two pieces of code above are BF Algorithm and MP(KMP) Algorithm.

By observing we can see that the two pieces of code are strikingly similar.

Let’s discover the relationship between them by simulating the process performed by the algorithm.

BF

BF Algorithm is naive matching algorithm. It’s idea is to match the text string and the pattern string bit by bit from the beginning. If the match fails (mismatch), the pattern string is shifted to the right by one bit, and then matched from the beginning.

The above picture shows 1313 times matching.

It can be found that the BF Algorithm can be very efficient in the case of too many mismatches.

MP

The 2 and 4 steps in the BF Algorithm are useless matches. Because pattern[0]=text[0]pattern[0] = text[0] and text[0]text[1]text[0] \neq text[1] , therefore, pattern[0]text[1]pattern[0] \neq text[1]. The second step must match the failure, the fourth step is the same. MP(KMP) Algorithm in order to improve this, do not backtrack the subscript of the text string, only change the subscript of the pattern string, minus the two steps.

The above picture shows 1111 times matching.

We get the first conclusion:

The difference between the MP (KMP) Algorithm and the BF Algorithm is that the pattern string is shifted to the right by one bit each time the BF Algorithm is mismatched, and the MP (KMP) Algorithm is shifted to the right by k(k1,kN)k(k \geq 1, k \in N) bits.

Calculate the number of right shifts

Knowing the difference between MP(KMP) Algorithm and BF Algorithm, the key now is how to calculate the right-shifted kk value.

Let‘s look at three more examples:

example

Marking that PP is the set of all prefixes of the pattern string, SS is the set of all suffixes of the pattern string, for the matched string of the example one P={A,AB,ABC,ABCA,ABCAB}P=\{A,AB,ABC,ABCA,ABCAB\}, S={B,AB,CAB,BCAB,ABCAB}S=\{B,AB,CAB,BCAB,ABCAB\}.

It can be found that the longest same element of the P and S of the matching string is AB, and the previous two matches are to move the pattern string from the prefix AB to the suffix AB.

Since MP(KMP) Algorithm is traversed from left to right, we only need to record the length of the longest same element of the PP and SS of all prefix substrings of the pattern string as l[i]l[i]. And then, after each mismatch, the pattern string is shifted to the right by $Length-l[i-1]-1. Marking $next[i]=l[i]$, and the MP(KMP) Algorithm will come out.

So we get the second conclusion:

MP Algorithm‘s right-shifted kk value equal to the longest same element of the PP and SS of the matched pattern string.

Compute next array

So how do you calculate next[i]next[i]? The answer is still coming from MP(KMP) Algorithm.

We match the pattern string with itself, noting that the string that matches successfully is the longest same element of the PP and SS of the corresponding substring.

self

Therefore, we only need to add the assignment statement to the nextnext array after matching in the MP(KMP) Algorithm program to complete the solution to the nextnext array. (next[0]=1,next[i]=0,i>0next[0]=-1, next[i]=0, i > 0)

int GetNext()
{
    next[0] = -1;
    auto prefix = -1;
    auto suffix = 0;
    while (suffix < pattern_size - 1)
    {
        if (prefix == -1 || pattern[prefix] == pattern[suffix])
        {
            ++prefix;
            ++suffix;
            next[suffix] = next[prefix];  //assignment
        }
        else    prefix = next[prefix];
  }
}

At this point, we have completed all the code of MP Algorithm.

The third conclusion:

Match the pattern string to itself to get the nextnext array.

Complete KMP algorithm

Let‘s look at an example:

KMP

In this example, the second match must fail. Because pattern[i]=pattern[next[i]]pattern[i]=pattern[next[i]], and pattern[i]text[i]pattern[i] \neq text[i], therefore, pattern[next[i]]text[i]pattern[next[i]] \neq text[i].

So we add a judgment to the function to optimize the algorithm.

int GetNext_KMP()
{
    next[0] = -1;
    auto prefix = -1;
    auto suffix = 0;
    while (suffix < pattern_size - 1)
    {
        if (prefix == -1 || pattern[prefix] == pattern[suffix])
        {
            ++prefix;
            ++suffix;
            if (pattern[prefix] != pattern[suffix])    next[suffix] = prefix;
            else    next[suffix] = next[prefix];
        }
        else    prefix = next[prefix];
  }
}

In the end, we get all the code for KMP Algorithm.

Time Complexity

  • BF Algorithm: O(nm)O(nm)
  • MP Algorithm: O(m)+O(n+m)O(m)+O(n+m)
  • KMP Algorithm: O(m)+O(n+m)O(m)+O(n+m)

  • BF Algorithm:Brute Force Algorithm
  • MP Algorithm:Morris-Pratt Algorithm
  • KMP Algorithm:Knuth-Morris-Pratt Algorithm
  • Reference Code
Last modified on November 3rd 2020 4:55:53 pm
Use WSL to Build a Development Environment
© 2020 · YXL
Build with Gatsby