Boyer Moore Algorithm - Scalar Topics

Problem Statement

We are provided with an input pattern and a text (or string), and we need to search and print all the occurrences of the pattern in the string.

Note: Use Boyer Moore's algorithm for searching the pattern in the string.

The first input is the size of the actual string i.e. n.
The second input is the size of the pattern string i.e. m.
The third input contains the actual string.
The fourth input contains the pattern to be searched.

Refer to the Example and Explanation sections for more details and the Approach section to understand the working of the Boyer Moore algorithm.

Example

Let us look at some of the examples provided to search for a pattern searching using the Boyer-Mooreoore algorithm.

Example 1: text = "Scaler Topics" pattern = "Topics"

Output: Pattern found at index: 7.

Example 2: text = "he is doing what he was told to do" pattern = "he"

Output: Pattern found at index: 0. Pattern found at index: 17.

Example Explanation

Before getting into the explanation of how to search or find the occurrences of a specific pattern in a string, let us first get a brief introduction to strings.

A String is an immutable data type that is used to store the sequence of characters. Strings are one of the most widely used data types of any programming language. A string can be easily created using quotes (either single quotes or double quotes).

Now, for searching the pattern in the string, we can use the Boyer Moore pattern searching algorithm (or B-M algorithm). The intuition of the Boyer Moore algorithm is very simple, two pointers are aligned at the 0th index of the text string and the character string. The pattern string is then compared character by character with the current portion of the text string, beginning with the rightmost character.

Now, if a character does not match then the Boyer-Moore algorithm shifts the characters using two strategies simultaneously. So, we can say that the Boyer Moore algorithm is a combination of two approaches namely:

Bad Character Heuristic, and
Good Suffix Heuristic.

Refer to the following sections to understand both approaches.

Now, coming back to the example, the provided text is: text = "Scaler Topics" and the pattern to be searched is: pattern = "Topics". Now, in this example, we would start the searching with the first character i.e. index-0, and the pattern is found from index-7 to index-12, but we need to print the index where the pattern occurrence starts. So, we would print index- $7$ . Similarly in the second example: text = "he is doing what he was told to do." pattern = "he" The first occurrence of he is found on index-0 and then again on index-17. So, we printed both indices.

Constraints

$- 1 ≤ |s| ≤ 10^5$

Here, s denotes the pattern string and text string.

In some problems, you may find the number of test cases represented by t So, we only need to call the BoyerMooreAlgorithm function `to-times.

The problem is to search a pattern searching using the Boyer-Moore algorithm.

Approach 1: Bad Character Heuristic

As we have discussed earlier that the Boyer Moore algorithm matches the pattern string with the provided text. Now, if a match is found then it returns the pattern's starting index. Otherwise, there may arise two cases:

1. When the Mismatched Character of Input Text is Present in the Pattern

In such cases, we call that character a bad character. So, when a bad character is found, we will shift the pattern until it gets aligned with the mismatched character of the text.

For example, let's say that the pattern given is: RPCRQ and the input text is: AYRRQMGRQRQ. Now, we start comparing the pattern string with the input string and we have found a mismatch between the character R of the text (the bad character) and the character C of the pattern. So, we will shift the pattern string until the character R of the pattern matches the character R of the text string.

Refer to the image provided below for better visualization. mismatch between the character shift till match

2. When the Mismatched Character of Input Text is Not Present in the Pattern

Let us take the same example as above and we have found that the mismatched character of input text is not present in the pattern.

Comparing from right to left

There is a mismatch between G and Q, but G is not at all present in the entire pattern string. So, we would not compare the entire pattern and we would easily shift the pattern until the character G on the input text as: Shifted till after G

Let us implement the above algorithm for better understanding.

C++ code:

Java code:

Python code:

Output:

Complexity Analysis

In this approach, we are traversing the pattern string for each portion of the text string.

Time Complexity

The time complexity of the algorithm comes out to be O (n x m), where n and m are sizes of text and pattern string respectively.

Space Complexity

Although we need to create the array for the preprocessing function the size of that array is fixed i.e. 256. We are not using any extra space apart from the preprocessing, so the space complexity of the algorithm comes out to be (O(1)).

Approach 2: Good Suffix Heuristic

Another approach for searching the pattern using the Boyer-Moore algorithm is to detect the good suffix and then do the processing. The main idea is to shift more efficiently when a mismatch occurs by aligning the overlapping parts of the pattern and the text string together.

Let us learn the algorithm with an example for better visualization. We start searching from the right-hand side and suppose that we have found a part of the pattern matching. Refer to the image (green color marks match).

start checking from here

We will continue the search and as soon as we have found a mismatch, we would stop. the portion of the pattern that matched is called Good String. And a mismatch means that the current location in the text is not the suitable starting place for the target pattern.

So, we should shift the pattern to find the next possible location for checking. We can use the good suffix for efficient shifting. There can be two cases if we use a good suffix for shifting.

When the good suffix exists somewhere in the pattern itself:

good suffix

As we can see that the good string is also present somewhere else. So, shift the pattern so that the good suffix parts are aligned as:

When the partial good suffix exists as a prefix of the pattern:

As we can that there is a mismatch but we cannot find the exact match of the good string. However, if we can find the suffix of the suffix (marked in blue) in the prefix of our pattern then we can also take thfrontmp of this match in shifting. good suffix exists as a mismatch

In such a case, shift the pattern so that the partial suffix is in alignment with the prefix of the pattern. good suffix exists as a pattern

So, we can add the virtual green part rest of the good string) to the blue partial good suffix to make it a virtually complete good suffix starting at the location before the pattern.

Let us implement the above algorithm for better understanding.

C++ code:

Java code

Python code:

Output:

Complexity Analysis

In this approach, we are traversing the pattern string and preprocessing the string to handle the strong suffix case, and prefix case.

Time Complexity

The time complexity of the algorithm comes out to be O (n x m), where n and m are sizes of text and pattern string respectively.

Space Complexity

We have used two arrays of sizes equal to the size of the pattern. So, the time complexity comes out to be [O(m) + O(m)] i.e. O(n) where n = (m+m) or twice the size of the pattern.

Conclusion

A String is an immutable data type that is used to store the sequence of characters. Strings are one of the most widely used data types of any programming language.
For searching the pattern in the string, we can use the Boyer-Moore pattern searching algorithm (or B-M algorithm).
The intuition of the Boyer-Moore algorithm is very simple, two pointers are aligned at the 0th index of the text string and the character string. The pattern string is then compared character by character with the current portion of the text string, beginning with the rightmost character.
When a character does not match then the Boyer Moore algorithm shifts the characters using two strategies simultaneously:
- Bad Character Heuristic, and
- Good Suffix Heuristic.
In the Bad Character Heuristic approach, we are traversing the pattern string for each portion of the text string.
In the Good Suffix Heuristic approach, we are traversing the pattern string and preprocessing the string to handle the strong suffix case, and prefix case.
The time complexity of the Bad Character Heuristic approach, comes out to be O (nm), where n and m are sizes of text and pattern string respectively. Apart from that, we are using an array of fixed size (257) so the space complexity of the algorithm comes out to be (O(1)).
The time complexity of the Good Suffix Heuristic approach, comes out to be O (nm), where n and m are sizes of text and pattern string respectively. We have used two arrays of sizes equal to the size of the pattern. So, the time complexity comes out to be [O(m)+ O(m)] i.e. O(n) where n = (m+m).