 Image credit: KiTA

# String Matching ①

## the KMP Algorithm

Mon   18 Mar 2019   00:48:50 1051 words *This article doesn't discuss the ‘why’ of the KMP Algorithm. It only explains the implementation of the KMP algorithm in a simple way. For further study, please refer to other materials.*

Firstly, let’s look at these two codes:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  int BF() { auto pattern_size = pattern.size(); auto text_size = text.size(); auto pattern_index = 0; auto text_index = 0; while (pattern_index < pattern_size && text_index < text_size) { if (pattern[pattern_index] == text[text_index]) //success { ++text_index; ++pattern_index; } else //failure { text_index -= pattern_index - 1; pattern_index = 0; } } if (pattern_index == pattern_size) return text_index - pattern_index; return -1; } 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  int KMP() { auto pattern_size = pattern.size(); auto text_size = text.size(); auto pattern_index = 0; auto text_index = 0; while (pattern_index < pattern_size && text_index < text_size) { if (pattern_index == -1 || pattern[pattern_index] == text[text_index]) //success { ++text_index; ++pattern_index; } else pattern_index = next[pattern_index]; //failure } if (pattern_index == pattern_size) return text_index - pattern_index; return -1; } 

The two pieces of code above are BF Algorithm and MP(KMP) Algorithm.

By observing we can see that the two pieces of code are strikingly similar.

Let’s discover the relationship between them by simulating the process performed by the algorithm. BF Algorithm is naive matching algorithm. It’s idea is to match the text string and the pattern string bit by bit from the beginning. If the match fails (mismatch), the pattern string is shifted to the right by one bit, and then matched from the beginning.

The above picture shows $13$ times matching.

It can be found that the BF Algorithm can be very efficient in the case of too many mismatches. The 2 and 4 steps in the BF Algorithm are useless matches. Because $pattern = text$ and $text \not= text$ , therefore, $pattern \not= text$. The second step must match the failure, the fourth step is the same. MP(KMP) Algorithm in order to improve this, do not backtrack the subscript of the text string, only change the subscript of the pattern string, minus the two steps.

The above picture shows $11​$ times matching.

We get the first conclusion:

The difference between the MP (KMP) Algorithm and the BF Algorithm is that the pattern string is shifted to the right by one bit each time the BF Algorithm is mismatched, and the MP (KMP) Algorithm is shifted to the right by $k(k \geq 1, k \in N)​$ bits.

Knowing the difference between MP(KMP) Algorithm and BF Algorithm, the key now is how to calculate the right-shifted $k​$ value.

Let‘s look at three more examples: Marking that $P$ is the set of all prefixes of the pattern string, $S$ is the set of all suffixes of the pattern string, for the matched string of the example one $P = \lbrace A, AB, ABC, ABCA, ABCAB \rbrace$,$S = \lbrace B, AB, CAB, BCAB, ABCAB \rbrace$.

It can be found that the longest same element of the $P$ and $S$ of the matching string is $AB$, and the previous two matches are to move the pattern string from the prefix $AB$ to the suffix $AB$.

Since MP(KMP) Algorithm is traversed from left to right, we only need to record the length of the longest same element of the $P$ and $S$ of all prefix substrings of the pattern string as $l[i]$. And then, after each mismatch, the pattern string is shifted to the right by $Length-l[i-1]-1$. Marking $next[i]=l[i]$, and the MP(KMP) Algorithm will come out.

So we get the second conclusion:

MP Algorithm‘s right-shifted $k$ value equal to the longest same element of the $P$ and $S$ of the matched pattern string.

So how do you calculate $next[i]$? The answer is still coming from MP(KMP) Algorithm.

We match the pattern string with itself, noting that the string that matches successfully is the longest same element of the $P$ and $S$ of the corresponding substring. Therefore, we only need to add the assignment statement to the $next$ array after matching in the MP(KMP) Algorithm program to complete the solution to the $next$ array. (next=-1, next[i]=0, i > 0$)   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  int GetNext() { next = -1; auto prefix = -1; auto suffix = 0; while (suffix < pattern_size - 1) { if (prefix == -1 || pattern[prefix] == pattern[suffix]) { ++prefix; ++suffix; next[suffix] = next[prefix]; //assignment } else prefix = next[prefix]; } }  At this point, we have completed all the code of MP Algorithm. The third conclusion: Match the pattern string to itself to get the$next$array. Let‘s look at an example: In this example, the second match must fail. Because$pattern[i] = pattern[next[i]]$, and$pattern[i] \not = text[i]$, therefore,$pattern[next[i]] \not = text[i]$. So we add a judgment to the function to optimize the algorithm.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  int GetNext_KMP() { next = -1; auto prefix = -1; auto suffix = 0; while (suffix < pattern_size - 1) { if (prefix == -1 || pattern[prefix] == pattern[suffix]) { ++prefix; ++suffix; if (pattern[prefix] != pattern[suffix]) next[suffix] = prefix; else next[suffix] = next[prefix]; } else prefix = next[prefix]; } }  In the end, we get all the code for KMP Algorithm. ### Time Complexity • BF Algorithm:$O(nm)​$• MP Algorithm:$O(m)+O(n+m)$• KMP Algorithm:$O(m)+O(n+m)\$

• BF Algorithm:Brute Force Algorithm
• MP Algorithm:Morris-Pratt Algorithm
• KMP Algorithm:Knuth-Morris-Pratt Algorithm
• Reference Code 