Statistical Functions in Pandas

Learn via video courses
Topics Covered

Overview

Working with tonnes of data can be very complex, especially when you are asked to perform mathematical functions. How are you going to calculate the mean, median, or standard deviation of thousands of data? Manual calculation? Or are you planning to use a calculator? Not a good idea. In situations like these, Pandas and their functions will always come to your rescue. In this article, we will be looking into how complex statistical operations can be performed in a single line using different functions.

Introduction

Statistics play a major role in mathematics, or anywhere data is involved. Here we learn about major statistical operations like Mean, Median, Mode, Variance, etc. Let's dive into how to perform such tedious tasks in a simplified manner with the help of Pandas.

Maximum Element and Minimum Element in Pandas

  • max(): It returns the maximum of the values over the requested axis.

Syntax:

DataFrame.max(axis=_NoDefault.no_default, skipna=True, level=None, numeric_only=None, **kwargs)

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with dataframes for index/rows, we use'0', whereas for columns, we use '1'. For Series, this parameter is unused and defaults to 0.

  • skipna: It takes in a bool value, the default value being True.

Note:- It Excludes NA/NULL values when computing the result.

  • level: It takes integer values or level names in terms of string. The default value is set to None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Note:- The level keyword is deprecated. Use 'groupby' instead.

  • numeric_only: It takes in boolean values. The default value is False. It is not implemented for Series.

  • **kwargs: This includes additional keyword arguments to be passed to the function.

Return type: It returns Series. If the level is specified, it will return Dataframe.

First, let's import the dataset we will be working with. You can download it from here.

Code Example 1:

Output:

Code Example 2:

Output:

  • min():

Syntax:

DataFrame.min(axis=_NoDefault.no_default, skipna=True, level=None, numeric_only=None, **kwargs) Return the minimum of the values over the requested axis.

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with dataframes for index/rows, we use'0', whereas for columns, we use '1'. For Series, this parameter is unused and defaults to 0.

  • skipna: It takes in a bool value the default value being True. It excludes all NA/NULL values while computing the result.

  • level: It takes integer values or level names in terms of string. The default value is set to None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Note:- The level keyword is depreciated. Use 'groupby' instead.

  • numeric_only: It takes in boolean values. The default value is False. It is not implemented for Series.

  • **kwargs: This includes additional keyword arguments to be passed to the function.

Return type: It returns Series. If the level is specified it will return Dataframe.

Code Example 3:

Output:

Mean, Median, Mode, Variance, and Standard Deviation in Pandas

The 3M’s Mean, Median and Mode are the most common operations that we perform under Statistics.

    • mean():- Another very popular term for the mean is average. How do we calculate the mean mathematically? For any n elements, the mean is calculated as the sumofallelements/nsum of all elements/ n. Let us see how we calculate the mean in Pandas.

pandas.DataFrame.mean:- It returns the mean of the values over the requested axis.

Syntax:

DataFrame.mean(axis=_NoDefault.no_default, skipna=True, level=None, numeric_only=None, **kwargs)

Parameters:-

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. At the same time, working with data frames for index/rows we use'0' whereas for columns we use '1'. For Series, this parameter is unused and defaults to 0.

  • skipna: It takes in a bool value the default value being True. It excludes all the NA/NULL values when computing the result.

  • level: It takes integer values or level names in terms of string. The default value is set to None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

  • numeric_only: It takes in boolean values. The default value is False. It is not implemented for Series.

  • **kwargs: Additional keyword arguments to be passed to the function.

Return type: It returns Series. If the level is specified it will return Dataframe.

Code Example 4:

Output:

    • median():- It will return the median value from the set of values being considered for the operation.

Syntax:

DataFrame.median(axis=_NoDefault.no_default, skipna=True, level=None, numeric_only=None, **kwargs) Return the median of the values over the requested axis.

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with dataframes for index/rows we use '0' whereas for columns we use '1'. For Series, this parameter is unused and defaults to 0.

  • skipna: It takes in a bool value the default value being True. It excludes the NA/NULL values while computing the result.

  • level int or level name: Its default value is None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Note :- This parameter got depreciated in further versions we see groupby instead.

  • numeric_only: It takes in boolean(bool) value. The default value is False. It includes only float, int, and boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

  • **kwargs: It is an additional keyword argument to be passed to the function.

Return type: It returns Series. If the level is specified, it will return Dataframe.

Code Example 5:

Output:

    • mode(): It is yet another statistical function that denotes the frequency of an element, that is the element that has occurred a maximum number of times in a given set of values. It can have more than one value.

Syntax:

DataFrame/Series.mode(self, axis=0, numeric_only=False, dropna=True)

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with dataframes for index/rows we use '0' whereas for columns we use '1'.

  • numeric_only: This parameter takes the boolean value, and the default value is set to False. If the value is True, it is only applicable to numeric value columns.

  • dropna: This parameter takes a bool value, the default value is True. It neglects the count of NA/NULL values.

Returns: It returns the maximum frequency value from the given set of values or dataset.

Code Example 6:

Output:

  1. Variance: It is yet another important statistical function that is defined as a measure of how data points vary from the calculated mean. It can also be explained as a measure of how far a set of data are dispersed out from their mean or average value.

Syntax:

DataFrame.var(axis=None, skipna=True, level=None, ddof=1, numeric_only=None, **kwargs) Return unbiased variance over the requested axis.

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with dataframes for index/rows we use '0' whereas for columns we use '1'. For Series, this parameter is unused and defaults to 0.

  • skipna: It takes in a bool value the default value being True. It will exclude all the NA/NULL values while computing the result.

  • level int or level name: Its default value is None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Note :- This parameter got deprecated in further versions we see groupby instead.

  • ddof: It takes in integer values. The default value is set to 1. It stands for Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only: It takes in a boolean value. The default value is False. It includes only float, int, and boolean columns. It is Not implemented for Series.

Return type: It returns Series or DataFrame. If the level is specified, it will return a DataFrame.

  1. Standard Deviation: The square root of the variance is defined as standard deviation. It can be defined as the measure of the distribution of statistical data. This function returns the sample standard deviation over the requested axis.

Syntax:

DataFrame.std(axis=None, skipna=True, level=None, ddof=1, numeric_only=None, **kwargs)

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with dataframes for index/rows, we use '0' whereas for columns, we use '1'. For Series, this parameter is unused and defaults to 0.

  • skipna: It takes in a bool value the default value being True. It excludes the NA/NULL values while computing the result.

  • level int or level name: Its default value is None. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Note: This parameter got depreciated in further versions we use groupby instead>

  • ddof: It takes in integer values. The default value is set to 1. It stands for Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only: It takes in a boolean value. The default value is False. It includes only float, int, and boolean columns. It is Not implemented for Series.

Return type: It returns Series or DataFrame. If the level is specified it will return a DataFrame.

Code Example 7:

Output:

Summarizing Data with Pandas Describe Method

What could be the potential function of the describe() method? The name speaks for itself. It returns the description of the data in the dataframe. The data helps provide statistical information like percentile, count, max, min etc.

Syntax:

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

Parameters:

  • percentiles: - It takes in a list of numbers. It is not mandatory. It depicts the percentiles to include in the output. The value ranges between 0 and 1. The default value is [.25, .50, .75], which returns the 25th, 50th, and 75th percentiles respectively.

  • include: It can either take the value 'all', list-like of dtypes or None. It is an optional parameter. It is a white list of data types to include in the result. It is not used in the case of Series.

    • all: All columns of the input will be included in the output.
    • list-like of dtypes: It restricts the results to the provided data types.

Note:- To restrict the result to numeric types use numpy.number. Note: To restrict the result to object columns use numpy.object. Note: To restrict results to pandas categorical columns, use 'category'. Note : It is the default value for this parameter. The result will include all numeric columns.

  • exclude: list-like of dtypes or None (default), optional, A black list of data types to omit from the result. Ignored for Series.

    • list-like of dtypes: It excludes the provided data types from the result.

Note: To exclude numeric types use numpy.number. Note: To exclude object columns use numpy.object. Note: To exclude Pandas categorical columns, use 'category'.

  • None: It is the default value for exclude parameter. The result will exclude nothing.

  • datetime_is_numeric: It takes in boolean values. The default value is set to False. It decides whether to treat datetime dtypes as numeric or not. This affects the statistics calculated for the column. While working on datframe this also controls whether DateTime columns are to be included by default.

  • Returns: It returns Series or DataFrame depending on the type of input provided for statistics.

Code Example 8:

Output:

Count Method in Pandas

It counts non-NA cells for each column or row. Syntax:

DataFrame.count(axis=0, level=None, numeric_only=False)

There are three parameters to be considered while using the function. Let us see how they impact the count function.

Parameters:

  • axis: It takes in only two values 0/1. It is the axis on which the function is to be applied on. While working with data frames for index/rows we use '0' whereas for columns we use '1'.

  • level: It takes in integer or string value. It is an optional parameter. It comes into use when the axis is a hierarchical or MultiIndex one. The level name then will be specified by a string value.

  • numeric_only: It takes in boolean values. The default value is set to False. It can include only float, integer or boolean data.

Return type: It returns a Series or DataFrame. For each row or column, it will return the number of null entries or non-NA values. If the level is specified it will return a DataFrame.

Code Example 9:

Output :

Covariance in Pandas

It computes the covariance of columns pairwise while excluding the NA/null values.

Syntax:

DataFrame.cov(min_periods=None, ddof=1, numeric_only=_NoDefault.no_default)

Parameters:

  • min_periods: It takes in integer values. This is an optional parameter. It specifies the Minimum number of observations that are required per pair of columns to have a valid result.

  • ddof: It takes in integer values. The default value is set to 1. It stands for Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only: It takes in boolean values. The default value is set to False. Similar to the count function parameter it only includes float, int or boolean data.

Return type: It returns the covariance matrix of the series of the DataFrame in the form of a Dataframe.

Code Example 10:

Output :

Correlation in Pandas

DataFrame.corr(method='pearson', min_periods=1, numeric_only=_NoDefault.no_default)

Parameters:

  • method: We have four methods of correlation that can be used for this parameter.

  • Methods of correlation:

    • pearson: standard correlation coefficient
    • kendall: Kendall Tau correlation coefficient
    • spearman: Spearman rank correlation
    • callable: callable with input two 1d ndarrays and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
  • min_periods: It takes in integer values. It is an optional parameter. It specifies the minimum number of observations required per pair of columns to obtain a valid result.

Note:- Currently it is only available for Pearson and Spearman correlation.

  • numeric_only: It takes in boolean values. The default value is set to False. Similar to the count function parameter it only includes float, int or boolean data.

Return type: It returns DataFrame with the computed correlation matrix.

Code Example 11:

Output:

Other Statistical Functions

prod()

This function returns the value of the product for the requested axis. It multiplies all the elements together and returns the product. The index axis is selected by default.

Syntax:

DataFrame.prod(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

The parameters behave the same as they do for previous functions.

Code Example 12:

Output:

We look into another example that contains missing values and see how the prod() function behaves.

Code Example 13:

Output:

cumsum()

This function returns the cumulative sum value over any given axis. Syntax:

DataFrame.cumsum(axis=None, skipna=True, *args, **kwargs)

The parameters behave the same as previous functions.

Code Example 14:

Output:

cumprod()

This function rerturns the cumulative product for all values over the given axis.

Syntax:

DataFrame.cumprod(axis=None, skipna=True, *args, **kwargs)

Code Example 15:

Output:

Percent Change

This function calculates the percentage change between the current and the previous element. It by default calculates the percentage change from the immediate previous row.

Syntax:

DataFrame.pct_change(periods=1, fill_method=’pad’, limit=None, freq=None, **kwargs)

Code Example 16:

Output:

Conclusion

In this article, we looked into major statistical operations and functions used to perform them. Let's have a quick recap of those functions and what they do:-

  • min() and max(): These help us to find the minimum and maximum values respectively.
  • mean(): It helps us to calculate the average value for a given dataset.
  • median(): It helps us to calculate the middle value for a given set of values.
  • mode(): It helps us to find the most frequent value for a given set of values.
  • variance(): It helps us find the dispersion of data.
  • describe(): It gives us a detailed statistical description of the data.
  • count(): It helps us count the non-missing values.
  • corr() and cov(): It helps us to find the correlation and covariance respectively.
  • While dealing with some more statistical functions we dealt with prod(), and cumprod() which helps us to find the product and cumulative product of the given axis respectively, and also we read about sum() and cumsum() helps us to find the sum and cumulative sum of the given axis respectively.

Quite a long article, isn't it? Don't worry. It's ok to be overwhelmed seeing so many of these functions, but the key is practice. You don't need to memorize it all. Just play with the data and use these functions on different kinds of datasets you will slowly get the knack for it. Till then, Keep Experimenting, Keep Learning.