Text Processing in NLP - Scaler Topics

Overview

Natural Language Processing is one of the most important branches of artificial intelligence. It helps the computer to interpret and reciprocate human language. Millions of data are flowing around in the form of speech, text, and audio. This data can be channelized in a particular way to extract valuable information. However, the data in its raw format is ineffectual. We must preprocess the data before pushing it through machine learning algorithms.

Introduction

Data in the form of text is widely used by many NLP applications. It is readily available and easy to work with. When you search anything on Google, your words get autocompleted, right? This is because the algorithm involved has been trained with thousands of words (originally in the form of raw text) for that purpose. Take the example of chatbots. You type your query, the data in the form of text is processed and pushed through the pipeline, and you get your results back in the form of answers. Sentiment analysis is performed on the feedback of the customers to predict their sentiments. Spam filtering is done on emails by processing text data. The applications are endless. But text preprocessing in NLP is crucial before training the data.

Significance of Text Pre-Processing in NLP

Text preprocessing in NLP is the process by which we clean the raw text data by removing the noise, such as punctuations, emojis, and common words, to make it ready for our model to train. It is very important to remove unhelpful data or parts from our text. This technique comes in very handy when you want to work with user-influenced data such as tweets. The sole purpose of text preprocessing in NLP is to improve the efficiency as well as the accuracy of our machine learning model. Consider the tweet: "I am just looooving NLP!!! with @john #NLPdays". Now, if this tweet is fed into our model, which determines its positivity or negativity, the model will have to go through all the individual words and other characters, whereas training with only the base word "love" would suffice Text preprocessing would remove the words such as "I", "am", punctuations, Twitter handles and hashtags and feed in the base word "love" into the model.

What is Text Analytics?

Raw text data is unstructured. Text analytics helps in converting this data into a quantitative one to extract meaningful insights and information. Text analytics also involves visualizing this information extracted in the form of charts and graphs. In text analysis, we process our text for information such as sentiments, topics involved, etc. Text analytics uses this information to obtain valuable insights, identify patterns and provide you with a solution or a suggestion to make improvements. Approximately 500 million tweets are done in a single day. Text analytics help companies extract valuable information, helping in various business decisions such as customer satisfaction, reputation, and product issues. The biggest advantage of text analytics is that it can be scaled up to any level. We can derive quantitative insights from a large amount of unstructured data, providing a huge boost in confidence while making important decisions.

Data Preprocessing

There are various steps involved in text processing in NLP. The most important steps are:

Word Tokenization: This is the first step in any NLP process that uses text data. Tokenization is a mandatory step, which simplifies things for our machine learning model. It is the process of breaking down a piece of text into individual components or smaller units called tokens. The ultimate goal of tokenization is to process the raw text data and create a vocabulary from it.

Output

Lower casing: This step reduces complexity. We convert the text data into the same case, preferably lowercase, so that we don't have to work with both cases.

Output

Punctuation removal: In this step, all the punctuations present in the text are removed.

Output

Stop word removal: The most commonly used words are called stopwords. They contribute very less to the predictions and add very little value analytically. Hence, removing stopwords will make it easier for our models to train the text data. We can use the gensim library in python to remove stopwords.

Output

Stemming: Stemming or text standardization converts each word into its root/base form. For example, the word "faster" will change into "fast". The drawback of stemming is that it ignores the semantic meaning of words and simply strips the words of their prefixes and suffixes. The word "laziness" will be converted to "lazy" and not "lazy".

Output

Lemmatization: This process overcomes the drawback of stemming and makes sure that a word does not lose its meaning. Lemmatization uses a pre-defined dictionary to store the context words and reduces the words in data based on their context.

NLTK: An Overview

Natural Language Tool Kit(NLTK) is the most widely used platform for building Python programs while working with the human language. This library simplifies the various text preprocessing steps to a whole new level. It provides a set of manifold algorithms used for Natural Language Processing. We can install NLTK using:

Anaconda

Jupyter Notebook

The aforementioned text preprocessing steps can be easily implemented using the NLTK library.

Tokenization: Texts can be tokenized with the help of nltk.word_tokenize

Output

Lower Casing: We convert a given piece of text to lowercase with the help of text.lower() .

Output

Stop word removal: NLTK uses a corpus of a list of stopwords from the English dictionary to filter them. We use nltk.corpus.stopwords.words(‘english’) for this purpose.

Output

Stemming: For the purpose of stemming we generally use the PorterStemmer of the NLTK library.

Output

Lemmatization: NLTK uses the WordNetLemmatizer for the purpose of lemmatization.

Output

Zipf Law in NLP

According to the Zipf law, in a text, the frequency of a token is dependent on its position in the sorted list. The frequency is inversely proportional to the rank of the word.

$f(r, \alpha) \propto 1 / r^\alpha$

Here f is the frequency, α is close to 1, and r is the rank of the word.

wordcount-org-image

This is an image from wordcount.org, which has 86,800 most frequent English words, ranked in the order of how common they are. "The" is the most commonly used English word right now.

NLP is dependent on Zipf's Law, for any elementary task that requires finding how words are related to each other. This law needs a large corpus to work properly.

What is Unicode?

An encoding system is used to represent text electronically. Some of them are UTF-8, UTF-16, and UTF-32. The difference between them is the number of bytes to store each input character.

What is Unicode Normalization?

This variant of normalization takes the concept of text normalization to another level. If two characters have different fonts, they may be interpreted by the computer and the machine learning model in a different way. There are different conditions here:-

The fonts are different but the words have the same meaning.
The fonts appear to be the same but the characters render a different meaning. This happens when we compare characters of different languages.

The first condition is the compatibility equivalence and the second one is the canonical equivalence Our model should interpret both ⓗⓔⓛⓛⓞ ⓦⓞⓡⓛⓓ “hello world” as one.

Unicode normalization provides a solution to both canonical and compatibility equivalence issues.

Conclusion

Text preprocessing is the most important step in any NLP task. Without it, the ship of NLP would be rudderless. The key takeaways from this article are:-

The process of text preprocessing removes all the noise from our text data to make it ready for text representation and to be trained for the machine learning model.
The key steps in data preprocessing are tokenization, stopword removal, punctuation removal, lemmatization, and stemming.
Then we move on to the NLTK library which makes the task of text preprocessing look like peanuts.
We finally saw the intuition behind Zipf law in NLP and the importance of Unicode normalization.