Fuzzy Search in NLP - Scaler Topics

Overview

A fuzzy search is a way to make the computer think and understand how humans think and understand. Formally, we can define the fuzzy search in NLP as the type of query that can deal with misspells, typos, etc. present in the input string and can provide us the desired output. We use the concept of fuzzy string matching to find the strings that match the given pattern approximately. We can also say that fuzzy string matching is a searching technique that is used to check or find the matching words even if the provided input contains misspelled words.

Pre-requisites

Before learning about the Fuzzy Search in NLP, let us first learn some basics about NLP itself.

NLP stands for Natural Language Processing. In NLP, we analyze and synthesize the input and the trained NLP model then predicts the necessary output.
NLP is the backbone of technologies like Artificial Intelligence and Deep Learning.
In basic terms, we can say that NLP is nothing but the computer program's ability to process and understand the provided human language.
The NLP process first converts our input text into a series of tokens (called the Doc object) and then performs several operations on the Doc object.
A typical NLP processing process consists of various stages like tokenizer, tagger, lemmatizer, parser, and entity recognizer. In every stage, the input is the Doc object and the output is the processed Doc.
Every stage makes some kind of respective change to the Doc object and feeds it to the subsequent stage of the process.

Introduction

We humans are quite smart, we can even guess the thing even if the spelling is a bit incorrect but the machine needs the correct spelling to guess the word. That is the reason humans are a step above machines. But to fill this gap a bit, we have the fuzzy search in NLP that helps the machines to guess the word even if the spelling is a bit incorrect.

We are taught about a certain language from birth so we can have an easy instinct about a certain word. The computer systems are trained using real-world knowledge and good data sets. This trains the model but the human-level accuracy and analysis cannot be met. So to somehow make the machines capable enough to achieve a good amount of output, we use the concept of fuzzy search in NLP.

Now the main question here is what is this fuzzy search? Well, a fuzzy search is a way to make the computer think and understand how humans think and understand. The fuzzy search in NLP contains two main parts:

Fuzzy Logic: Fuzzy logic is a computational intelligence technique that helps in making effective decisions.
Example Data Sets: We provide various real-world data sets to match the human-level accuracy. We can use pre-existing NLP applications like Google Translator, Google Search Engine, MIT Start, etc.

Formally, we can define the fuzzy search in NLP as the type of query that can deal with misspells, typos, etc. present in the input string and can provide us the desired output. In the next section, we will learn how it works and the logic behind it.

String Matching

String matching means we are comparing two strings to check whether they are similar or not. We also use the fuzzy algorithm concept in the string-matching problems. We use the concept of fuzzy string matching to find the strings that match the given pattern approximately. We can also say that fuzzy string matching is a searching technique that is used to check or find the matching words even if the provided input contains misspelled words. The fuzzy string matching technique is also used in situations where the user only inputs partial words for the searching. So, it is also known as the approximate string matching technique.

Let us now look at the various applications of the fuzzy string matching technique.

Spell Checker or spelling error detector like Grammarly.
For checking the duplicate records in a database.
To search for matching products on e-commerce websites.

We will be seeing the practical demonstration of the fuzzy string matching technique using the Python library fuzzywuzzy in the later section.

The Concept of Fuzzy Search

Let us now discuss the fuzzy search in NLP in detail. In the fuzzy search queries, we scan for the words having a similar type of composition and then we expand our search to find all the near matches of the provided query.

In the background, when a fuzzy search is made, the search engine creates a graph that is comprised of similar terms (for the similar terms present in the query). The created graph usually expands up to 50 permutational expansions and each of these expansions or each term captures both incorrect and correct variants present in the process. After processing the built graph, the search engine provides us with the most relevant result present as the topmost match. The search engine uses dictionaries for matching the searches and for creating the graph.

The fuzzy search in NLP internally uses the lexical analysis feature of the NLP pipelining system. First, our query input is added to the query tree, which is ultimately enlarged to create the graph. So, the formation of the graph starts when we provide the input for searching.

Since a letter can be in the capital and the smaller case, the searching, and the graph formation are done in smaller case only, so the input is first transformed in the lower case. As we can see, this is a time-consuming task, so it is a slow searching algorithm. The time and the speed also depend upon the size of the word and the type of query.

An example of a fuzzy search in NLP can be searching the word- search. If we provide the words like serach, serch, or sarch. The NLP searching engine can create a graph and detect the correct word i.e. search.

String Matching using Fuzzy Search

Let us now perform the string matching with the help of fuzzy search in NLP. For example, if we want to search for the word apple, we can provide the words like aple, appl, apples, etc. We will use the Python library for the demonstration, i.e. fuzzywuzzy. Since the fuzzywuzzy library is external, we first need to install this on our library using the command- pip install fuzzywuzzy.

Let us now perform various operations for more clarity.

Ratio

The ratio function is used to deal with complex situations like finding the similarity index between two provided strings. We provide two texts and check whether these strings are matching or not and if they are matching then how much similar they are. This function first tokenizes the input string, then pre-processes the string and then converts it into lowercase, and removes all the punctuation marks for matching with the other string.

Example:

Output:

Partial Ratio

The partial_ratio is a complex function that can deal with situations like finding the matching substring. Suppose that the smaller string has length m and the longer string has length n, then this function uses an internal algorithm and finds the score of the matching substring. This function first tokenizes the input string, then pre-processes the string and then converts it into lowercase, and removes all the punctuation marks for matching with the other string.

Example:

Output:

As we can see that the partial ratio is found to be 100 which means that the string is found completely in the other string.

Token Sort Ratio

It is the next level function to the ratio() and partial_ratio() functions. If the strings are the same but in a different order then this function is used. This function does tokenization and then sorts the tokenized string in alphabetical order and finally joins them together for matching. Hence, it is comparatively slower than the previous two functions. After sorting and adding, the token_sort_ratio() function internally calls the ratio() function to get the actual result of matching.

Example:

Output:

As we can see that the token sort ratio is found to be 100 which means that the string is found completely in the other string.

Token Set Ratio

If the strings are the same but in a different order and are of different lengths then this function is used. This function does not perform the previous function's work i.e. tokenization and then sorting the tokenized string in alphabetical order. This function internally performs the set operation and then takes out the tokens that are common or found in the intersection and it internally calls out the ratio() function and performs the comparison among the provided strings.

Example:

Output:

As we can see that the token set ratio is found to be 100 which means that the string is found completely in the other string even if they are of different lengths.

Fuzzy Process Extract

It is a process or technique used to calculate the highest similar string from the vector of strings.

FuzzyWuzzy also comes with a handy module, process, that returns the strings along with a similarity score out of a vector of strings. All we need to do is call the extract() function after process.

Example:

Output:

Similarly to the extract function, you can also use the process module to only extract one string with the highest similarity score by calling the extractOne() function.

Conclusion

Formally, we can define the fuzzy search in NLP as the type of query that can deal with misspells, typos, etc. present in the input string and can provide us the desired output. We use the concept of fuzzy string matching to find the strings that match the given pattern approximately.
An example of a fuzzy search in NLP can be searching the word- search. If we provide the words like serach, serch, or sarch. The NLP searching engine can create a graph and detect the correct word i.e. search.
A fuzzy search is a way to make the computer think and understand how humans think and understand. We can say that fuzzy string matching is a fuzzy searching technique that is used to check or find the matching words even if the provided input contains misspelled words.
The fuzzy search in NLP is comprised of two main parts: The first one is fuzzy logic which is a computational intelligence technique that helps in making effective decisions. The second part is the data set. We provide various real-world data sets to match human-level accuracy. We can use pre-existing NLP applications like Google Translator, Google Search Engine, MIT Start, etc.
The fuzzy string matching technique is also used in situations where the user only inputs partial words for the searching. So, it is also known as the approximate string matching technique.
Some of its applications of fuzzy search in NLP are: Spell Checker or spelling error detector like Grammarly, For checking duplicate records in a database, To search for matching products on e-commerce websites, etc.