Statistical Functions in NumPy

Overview

Numerous statistical functions are available in Numpy and are used to analyze data. We will talk about the statistical functions offered by NumPy in this section of the article.

These functions help determine the statistical numbers of the data given as input. These involve maximum, minimum, mean, median, variance and standard deviation among others.

Introduction

We live in a world where everything is dependent on data, hence understanding we must succeed in any field of endeavor.

Statistics is a science that concerns the collection, organization, analysis, interpretation, and presentation of data in an effective manner.

Statisticians provide vital advice for creating reliable analyses and forecasts. They ensure that a study's entire design adheres to the recommended procedures to get reliable results. The techniques they use consist of generating accurate data and performing necessary data analysis.

In computer science, a function is something that holds variables and can be reused. A function can be defined as anything which contains variables, and the solution to the function provides us with the answer to our problem. The definition of a function also includes the meaning of reusing something efficiently.

Now let’s club the above-mentioned terms together and give meaning to the term statistical functions. This means, using functions for statistical purposes.

Note: Throughout the code snippets, an alias for numpy as np will be used for better readability and neater representation of code.

Maximum element and Minimum element in a NumPy Array

Finding the minimum and maximum array elements along the provided axis is done using the numpy.amin() and numpy.amax() functions, respectively.

Maximum element in a NumPy array We will make use of the np.amax() function from the NumPy package to do this.

Syntax: np.amax(input_array)

input: NumPy array

output: Singular value depending on data type (maximum value)
Output:
Minimum element in a NumPy array We will make use of the np.amin() function from the NumPy package to do this.

Syntax: np.ain(input_array)

input: NumPy array

output: Singular value depending on data type (minimum value)
Output:

Mean, Median, Variance in NumPy

Mean Mean is also termed as the average of the data in hand. It gives us an overall idea or a picture of the data.

Mean is one of the factors used to detect skewness in data, which means seeing how the data is imbalanced.

It is calculated as,

$\mu = {(x1 + x2 + x3 + ... + x_n) \over n}$

Where µ is the mean, x is the data values and n is the total number of data points.

Let's see how to implement this using NumPy,

Syntax: np.mean(input_array)

input: NumPy array

output: Floating point value (mean)
Output:
Median Median is the most central value in our data when placed in a sorted order(ascending or descending).

Median is one of the other factors used to detect skewness in data.

It gives us an insight into the distribution of the data.

The formula for median varies for odd and even numbers of data

When n is odd,

$median = {(n+1)^{th}\over 2}\;term$

When n is even,

$median = {{(n)^{th}\over 2}\; term} + {({n\over2}+1)^{th}\;term}$

Syntax: np.median(input_array)

input: NumPy array output: Floating point value (median)

For when n is odd,
Output:
When n is even,
Output:
This is calculated as per the formula stated above for when n is even.

NumPy only gives the output as a single floating-point value.

As per the formula, both 69 and 79 should be treated as the medians of the given data.

But NumPy only outputs a singular value, therefore NumPy took the average of the two medians and gave that as an output.
Variance Variance is the average of the square deviations.

It is a statistical measurement used to determine how far each data value is from the mean and from every other data value in the set.

KaTeX parse error: Unknown accent ' ̅' at position 27: …{{\sum {(x_i - x̲̲̅)}^2}\over n}

Where x is the current data, KaTeX parse error: Unknown accent ' ̅' at position 1: x̲̲̅ (x-bar) is the mean of the data set and n is the number of data points.

Let's see how to implement this using NumPy,

Syntax: np.var(input_array)

input: NumPy array output: Floating point value (variance)
Output:

Standard Deviation using NumPy

Standard deviation is the square root of the average of square deviations from the mean. It is a statistic that measures the dispersion of data relative to its mean. The formula for standard deviation is,

KaTeX parse error: Unknown accent ' ̅' at position 22: …sqrt{\sum{(x - x̲̲̅)^2}\over n}

Where x is the current data, $x̄ (x-bar)$ is the mean of the data set and n is the number of data points.

Let's see how to implement this using NumPy,

Syntax: np.std(input_array)

input: NumPy array output: Floating point value (standard deviation)

Output:

Finding Average Function

The multi-dimensional arrays and the weighted average are calculated via the NumPy np.average() function. The component is multiplied by its weight, which is given individually, to create the weighted average. If weights are not supplied, the mean is produced instead.

Syntax: np.average(input_array, weights)

input: NumPy array, weights

output: Floating point value (weighted average)

Output:

Explanation: Here in the array $[1, 2, 3]$ are respectively multiplied by their weights specified as an array $[5, 6, 7]$ .

Therefore we get the value as $[38] (5 + 12 + 21)$ . This value is then divided by the sum of the weights ie. $18 (5 + 6 + 7)$ giving us the weighted average as 2.111111111111111.

Determining Percentile Function using NumPy

In statistics, a q-th percentile (percentile score or centile) is a score below which a given percentage q of scores in its frequency distribution falls. For example, 25th percentile means that 25% of the values fall beneath the 25th percentile value.

This is a highly useful function in determining outliers in data and also to understand the distribution of the data values.

One use case of this is in, outlier removal in machine learning during data preprocessing, using the IQR (Interquartile Range), you can remove or cap outliers in your data based on IQR. You can look more deeply into this if you are interested, but this falls beyond the scope of this article.

Let's see how to implement this using NumPy,

Syntax: np.percentile(input_array, percentile_value)

input: NumPy array

output: Floating point value (percentile)

Output:

This means that 25% of the values are below 10.5.

Numpy Peak-to-Peak Function

Finding the range of values along an axis can be done with the help of the NumPy np.ptp() method [ptp stands for the peak to peak].

It returns a range of values (maximum-minimum) along the specified axis.

Syntax: np.ptp(input_array, axis)

input: NumPy array, axis

output: NumPy array (range of values max-min)

Output:

Explanation: $axis=1$ specifies a column-wise operation and $axis=0$ specifies a row-wise operation.

While looking at its column-wise there are 4 columns with 2 elements each, therefore taking the max-min operation on each element gives us an array with 2 elements, and while looking at its row-wise there are 2 rows with 4 elements each, therefore taking the max-min operation on each element gives us an array with 4 elements.

For column wise, $[6-4 = 2, 9-9 = 0, 7-2 = 5, 12-10 = 2]$ Therefore, returned array is $[2, 0, 5, 2]$ .

For row-wise, $[10-2 = 8, 12-6 = 6]$ Therefore, returned array is $[8, 6]$ .

Conclusion

Let's discuss what we have learned here.

Statistics is a science that concerns the collection, organization, analysis, interpretation, and presentation of data in an effective manner.
Implementation of statistical functions using the NumPy package.
Definition and illustration of the examples of various statistical functions using NumPy package like mean, median, variance, etc. through code snippets.