spaCy’s Statistical Models

Learn via video courses
Topics Covered

Overview

The computer's ability to understand and interpret human language is termed Natural Language Processing(NLP). This aspect of machine learning is extremely useful and has been implemented in numerous ways. There are several libraries in the Python language that deals with NLP tasks. One such library is spaCy and this article revolves around the spaCy library.

Introduction

Unstructured textual data is produced at a large scale, and it’s important to process and derive insights from unstructured data. To do so, we need to convert the data into a form that is understood by the computer and NLP is the way to do it. spaCy is a free and open-source library for Natural Language Processing in Python with a lot of in-built capabilities. The popularity of spaCy is growing steadily as the factors that work in its favor of spaCy are the set of features it offers, the ease of use, and the fact that the library is always kept up to date.

What are Statistical Models?

The process of applying statistical analysis to a dataset is called statistical modeling. A statistical model is a mathematical representation (or model) of observed data.

spaCy's statistical models are the power engines of spaCy. These models help spaCy to perform several NLP-related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

List of Statistical Models in spaCy

  • en_core_web_sm: English multi-task CNN trained on OntoNotes. Size — 11 MB
  • en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size — 91 MB
  • en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size — 789 MB

We import the spaCy models using spacy.load(‘model_name’).

How to Use the Statistical Models?

Using Your Model

We will perform POS tagging using spaCy

To use spaCy for your model, follow the steps below:-

  • Install spaCy from your command line.
  • Import the package
  • Create spaCy object
  • Print the tags using the spacy object

Output:-

Naming Conventions

In spaCy, all the pipeline packages follow the naming convention: [lang]_[name]. For spaCy’s pipelines, we divide the name into three components:

  • Type: This signifies capabilities (e.g. core for the general-purpose pipeline with tagging, parsing, lemmatization, and named entity recognition, or dep for only tagging, parsing, and lemmatization).
  • Genre: Type of text the pipeline is trained on, e.g. web or news.
  • Size: Package size indicator, sm, md, lg, or trf.

Package Versioning

The pipeline package versioning signifies both the compatibility with spaCy, as well as the model version. A package version a.b.c means:-

  • spaCy major version: For example, 2 for spaCy v2.x.
  • spaCy minor version: For example, 3 for spaCy v2.3.x.
  • Model version: Different model config: e.g. from being trained on different data, with different parameters, for different numbers of iterations, with different vectors, etc.

spaCy’s Processing Pipeline

Let's create a spaCy object.

Now, pass a string of text to the nlp object, and receive a Doc object.

What does the NLP Object do?

What does the nlp Object do

  • First, the tokenizer is applied to turn the string of text into a Doc object.
  • Next, a series of pipeline components are applied to the doc in order. In this case, the tagger, then the parser, then the entity recognizer.
  • Finally, the processed doc is returned, so you can work with it.

Conclusion

The key takeaways from this article are:-

  • spaCy is a free and open-source library for Natural Language Processing in Python with a lot of in-built capabilities.
  • The process of applying statistical analysis to a dataset is called statistical modeling.
  • The statistical models help spaCy to perform several NLP-related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.