Working with Categorical Data in Pandas

Learn via video courses
Topics Covered

Overview

The Categorical Data or Categoricals is a data type in Pandas which corresponds to the categorical variables used in statistics. The categorical variable takes on a limited set of values that are usually fixed. The categorical data may have a fixed order, but we cannot perform numerical operations on the categorical data. The categorical data type takes lesser memory as compared to normal variables. Some examples of categorical variables are observation timings, blood type data, country affliction data, gender data, etc.

What is Categorical Data?

Before learning about categorical data and how we can work with categorical data in Pandas, let us get a brief introduction to Pandas.

Pandas is an open-source (free to use) library that is built on top of another very useful Python library i.e. NumPy library. Pandas is an open-source package (or library) that provides us with highly optimized data structures and data analysis tools. Pandas library is widely used in the field of data science, machine learning, and data analytics as it simplifies data importing and data analysis. The Pandas module helps us to work with large data sets (or data frames) in terms of rows and columns. We generally use the Pandas module to deal with the CSV (Excel) files.

Pandas Python package offers us a wide variety of data structures and operations that helps in easy manipulation (add, update, delete) of numerical data as well as the time series. The prime reason for the Pandas package's popularity is its easy importing feature and easy data analyzing data feature. Pandas module is quite fast and comes in very handy because of its high performance and productivity.

Now, what is categorical data? Well, Categorical Data or Categoricals is a data type in Pandas which corresponds to the categorical variables used in statistics. Some examples of categorical variables are observation timings, blood type data, country affliction data, gender data, etc. One of the major factors related to the categorical variable is that the categorical variable takes on a limited set of values that are usually fixed. The categorical data may have a fixed order, but we cannot perform numerical operations on the categorical data.

When is Categorical Data Useful?

Let us look at some of the scenarios in which categorical data is quite useful.

  • Whenever we have string variables consisting of very few different values, we can convert these types of strings into categorical variables. This conversion of strings into categorical variables saves memory.
  • When the lexical order of some variable(s) is not the same as the logical order (i.e. not the same as 'one', 'two', 'three', etc.) then we can convert these types of lexical ordered variables, into the categorical variables and specify some kind of order on these categories. Since the categorical variables will now contain logical order and hence we can perform mathematical operations like sorting, finding min/max, etc.
  • If we want to tell other Python libraries that the current column should be treated as a categorical variable, then we can use the categorical data transformation. If the data is treated as categorical data, then we can easily plot graphs and use suitable statistical methods on the data set.

Categorical Data Methods

Let us now learn about some of the categorical data methods.

  • series.astype() :
    This method converts the series data into categorical data.
  • categoricals.cat :
    The cat attribute helps us to access the categorical methods.
  • categoricals.cat.codes :
    The codes method is used to view the codes of values present in the data.
  • categoricals.cat.categories :
    The categories method is used to view the categories of the data.
  • categoricals.cat.set_categories :
    The set_categories method is used to increase the values of the categories.
  • categoricals.cat.remove_unused_categories :
    The remove_unused_categories method is used to remove the unused categories present in the data.
  • pandas.get_dummies(categoricals) :
    The get_dummies function is used to convert the categorical data into dummy data.

Our dataset can contain several types of data values, so for better performance, we convert these data into dummy variables. We can find the usage of these dummy variables in the field of machine learning. The dummy variable is a binary type of variable which that indicates whether the separate categorical variable takes on a specific value. We can use the get_dummies() function to convert categorical data into dummy variables.

To learn more about these functions and other functions, please refer to the later sections. You can refer to the official documentation for more functions and methods.

Categorical Object Creation

Let us now learn how various types of categorical objects are created.

1. Series Creation

A series is nothing but a column present in the Pandas DataFrame (which can be seen as a table). Let us see how we can create categorical data when creating a Series. If we want the series to be in the form of categorical data, we can specify the dtype (data type) as category. Refer to the example provided below for more clarity.

Output:

  • We can also convert an already created series into category data by assigning category as astype in the astype() function.
  • We can also pass the pandas.Categorical object to the Series() function to create a categorical series.

2. DataFrame Creation

As we have discussed above, similar to the conversion of one series into a categorical data series, we can convert the entire series into a categorical data frame. Refer to the example provided below for more clarity.

Output:

  • We can use the astype() function on data frames as well =, we just need to specify the astype as category in the parameter to the function.

3. Controlling Behavior

In the above two examples of series and data frames, we have passed the default behavior i.e. category as the data type. Let us now learn about other ways of defining categories.

  1. We can pass the instance of CategoricalDtype in place of category in the series. Refer to the example provided below for more clarity.

    Output:

    We can see that the data is converted into categorical data.

  2. We can pass the instance of CategoricalDtype in place of category in data frames. Refer to the example provided below for more clarity.

    Output:

4. Regaining Original Data

We can convert the categorical data into the original data and can use the Series.astype(original-dtype) or np.asarray(categorical) functions. Refer to the example provided below for more clarity.

Output:

Converting to Categorical Data

We can convert a variable into a categorical type of variable. Let us take an example for more clarity.

Output:

Working with Categorical Data

Let us now learn how we work with various categorical data and how various operations are performed on the categorical data.

Now, the categorical data consists of two properties i.e. categories and ordered. We can set the dtype to category to make the categorical data. We can also set the ordering using the ordered property. The categories property is exposed as s.cat.categories and the ordered property is exposed as s.cat.ordered.

1. Renaming Categories

If we want to rename the categories, we can use the rename_categories() method. Refer to the example provided below for more clarity.

Output:

2. Appending New Categories

We can use the add_category() method if we want to add new categories. Refer to the example provided below for more clarity.

Output:

3. Removing Categories

If we want to remove any category, we can use the remove_categories() method. We must know that the removed values are replaced using the np.nan value. Refer to the example provided below for more clarity.

Output:

4. Removing Unused Categories

As we have used the remove_categories() method to remove the specified data, similar to that, we can use the remove_unused_categories() method to remove the unused categories. Refer to the example provided below for more clarity.

Output:

5. Setting Categories

As we have learned above, we can remove the data as well as add newer data. Now, we can perform both operations in a single step using the set_categories() method, as it is faster. Refer to the example provided below for more clarity.

Output:

Sorting and Ordering Categorical Data

Let us now learn how we can sort and order categorical data. As we know that the order property can be set as True, which means that the category is ordered. We can perform the mathematical operations on sorted data, but if the data is unsorted, the mathematical operations like .min(), and .max() shows an error i.e. TypeError.

Note :
We can set the categorical data to be ordered by using the as_ordered() function or unordered by using the as_unordered() function.

Re-ordering

We can reorder the categories using the Categorical.reorder_categories() function as well as using the Categorical.set_categories() function. If we are using the Categorical.reorder_categories() function then the old categories must be included in the new categories and no new categories are allowed. Refer to the example provided below for more clarity.

Output:

Multi-Column Sorting

Similar to the column sorting and reordering discussed above, we can also sort multiple columns. Refer to the example provided below for more clarity.

Output:

Different Operations in Categorical Data

Let us now learn about the comparison of categorical data and various other related operations.

Comparing the Categorical Data

We can use == and != operators to compare the list-like object (for example, lists, Series, arrays, etc.) having the same length as that of the categorical data. We can also use operators like ==, !=, >, <, >=, and <= to compare categorical data to another categorical series when the order of the category is set as True and both the data are of the type categories.

Note:
If the compared data are of different categories, it will raise a TypeError.

Other Operations

In the above sections we have learned about the operations like Series.min(), Series.max(), and Series.mode(). Some of the other important operations that can be performed on the categorical data are :

  • Series.value_counts() :
    This method will count the number of values even if some categories are not present in the data.
  • DataFrame.sum() :
    This method is used to calculate the sum of the categories, it will also show the unused categories.

Data Munging

Data munging refers to preparing our data for a dedicated purpose. The Pandas module provides some optimized methods like .loc, .iloc, .at, and .iat, we can use these methods on categorical data as well but the return type of the functions is changed. Let us learn about various ways of data munging.

Some of the important things related to data munging are as follows:

  • When we perform the slicing operation of the data, the slicing operation returns a DataFrame or a Series, and the type of the data i.e. category is preserved.
  • When we operate on a single row, then the category type is not preserved; hence the return type of the Series becomes object.
  • If we return a single data item from the set of categorical data then the returned value is not categorical data of length 1 but it is a value only.
  • If we want to get a single Series of category type then we can pass the list with only a single value.
  • If we want to work with the accessors like .dt and .str of the string and DateTime, respectively then we can use them with the s.cat.categories type.
  • If we are combining Series or DataFrames we can use the astype() method or the union_categoricals() method to ensure that the result is a category.

What is the Categorical Index in Pandas?

If our data contains some duplicates and we want to index the values, then we can use the CategoricalIndex. The CategoricalIndex is a type of index that acts as a container around the categorical data and helps us in efficient data storage and indexing when there is a large number of duplicate elements. Refer to the example provided below for more clarity.

Output:

Removing Missing Categorical Data in Pandas

If we have some missing values in the categorical data, we can perform some operations to deal with the missing data. Let us briefly learn about them.

  1. Delete:
    We can delete the entire column of missing data. We can also delete the rows having null values or missing values.
  2. Replace:
    We can also replace the missing data with some value like Nan (we have earlier discussed using some examples) or with the most frequently used values. We can even replace the missing values using the Classifier Algorithm.

Performance of Categorical Types

The categorical version of the data works faster and takes lesser space in memory than the normal Data Frame. So if we are working with big data (a large set of data), then converting the data into categorical data can be beneficial.

For more clarity, let us now look at the memory usage of normal and categorical data.

Output:

Conclusion

  • The Categorical Data or Categoricals is a data type in Pandas which corresponds to the categorical variables used in statistics. Some examples of categorical variables are observation timings, blood type data, country affliction data, gender data, etc.
  • The categorical variable takes on a limited set of values that are usually fixed. The categorical data type takes lesser memory as compared to normal variables.
  • Whenever we have string variables consisting of very few different values, we can convert these types of strings into categorical variables. This conversion of strings into categorical variables saves memory.
  • If we want to tell other Python libraries that the current column should be treated as a categorical variable, then we can use the categorical data transformation.
  • The dummy variable is a binary type of variable which that indicates whether the separate categorical variable takes on a specific value. We can use the get_dummies() function to convert categorical data into dummy variables.
  • The categorical version of the data works faster and takes lesser space in memory than the standard Data Frame.