Resampling, Rolling Calculations, and Differencing in Pandas

Learn via video courses
Topics Covered

Overview

Pandas offer many types of different tools that help to process the data effectively to perform various types of calculations. When the data has extreme points then we have to reduce the test error and we need to process our data to do that. Pandas offer many functions to do the same, this involves finding patterns between the data, resampling the data, processing data sequentially, etc. Resampling in Pandas is an important tool that can be used to get important statistical information from the data.

Introduction

When the data varies a lot concerning the mean of the data sample i.e. when the data is not centered around the mean and it is very scattered, then it is known as Imbalanced Class Distribution. Then we have to resample the data as Machine Learning Algorithms don't take this Class Distribution into account. While working around data, we have to be able to take differences around a certain axis of the data, and get statistics about a particular window, for all of these needs Pandas provides functions that make our life easier.

Resampling

What is Resampling?

Resampling in Pandas is a series of techniques used in statistics to gather more information about a sample. This can include retaking a sample or estimating its accuracy. With these additional techniques, resampling in pandas often improves the overall accuracy and estimates any uncertainty within a population.

The Resample() Method

The resample() method in Pandas Dataframe helps to resample the Time Series Data.

It is a convenient method for frequency conversion and resampling of time series. The object must have a DateTime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a DateTime-like series/index to the on/level keyword parameter.

Methods of Resampling

There are two common methods of Resampling in Pandas

1. Cross-Validation For predictive statistical models, statisticians frequently utilize cross-validation. Using this method, you can put aside several data points from sampling to serve as the validating set. The training set consists of the remaining observations in the group. People can forecast the validating set by testing the training set. You can compile the accuracy mean for the predictions after each cross-validation to find out how accurate each predictive model is.

2. Boot Strapping Using repeated sampling, you duplicate your observations using the bootstrap method. For instance, if you chose 10 people out of 100 to watch for a hypothesis, you can repeat this process numerous times, choosing 10 different people each time. As a result, you may compare measurements like the mean or median between various samples to obtain more precise answers, which helps to minimize any statistical errors. This approach, also known as the plug-in method, is frequently used in genetic algorithms.

Example For this example, we are going to use a publicly available dataset from Kaggle. It is a dataset that shows the opening and closing stocks of HDFC.

Code

Here, we are reading the dataset, and giving the parameter parse_dates to parse the Date column as a Datetime object. We are also setting the DateTime data as the index of our dataframe which is essential to resample the Data. This dataset is huge so we are narrowing it down by taking only the opening and closing price of the stock. This is what the data looks like now.

Output

For example, I need the mean opening and closing price of my Data every 4 years. Instead of running a loop and finding the mean one by one, we can use the resample() method.

Code

Here, this shows that we are resampling the data every 4 years, and finding the mean of the clustered data. The output looks like this.

Output

Rolling

Pandas rolling() function provides a way to solve calculations in a rolling window i.e. we take a window of K data points and perform some operation on it, and then we keep repeating the process for the whole data.

The .rolling() Method

  1. window - Size of the rolling window
  2. min_periods - Minimum number of data points required for each window, otherwise, the result is NA
  3. axis - The axis of the dataframe on which we have to perform our operations.

Example

In this example, we have taken the previous dataset only, we are trying to get data for every 3 days, which would be 3 data points, as each data point is corresponding to 1 day.

Code

Output

Here, we can see that the first 2 values are NaN because, there are not enough data points for the sliding window, as each value is corresponding to the previous window of n data points, 3 in this case.

What is Differencing?

Introduction

Differencing method is used to find differences between discrete values in the Dataframe, over the provided period. This is a very good tool for time series analysis and getting valuable statistics.

Syntax

  1. periods: Periods to shift for forming the difference
  2. axis: Take the difference over rows (0) or columns (1).

Examples

Using the previous dataset, we can see that, we have information about the opening and closing rates of stocks. If we want to find out how the opening and closing price of stocks change when compared to the previous day, we can use the diff() method with period 1 which means comparing with the previous data point which in this case means the previous day.

Code

Output

Here, in this output, we are able to observe how the data changes when compared to each day. The first value is NaN because data for the previous day is not available. Similarly, we can do this by changing the period, if we want to find out how data is changing according to the previous week, we can just change the periods parameter to 7. We can do a similar kind of analysis on rows as well by changing the axis to 1.

Code

Output

Here, the first column is NaN because the data on its left side does not exist. For the second column, we can find how the stock changes each day as it gives the difference between the closing price and the opening price.

Conclusion

  • We can do complex statistical analysis of data on Pandas.
  • This is especially useful for time series data.
  • We can resample the data for better accuracy.
  • Resampling is required when the data is not centered around the mean.
  • For more statistical information, we can use the rolling() and diff() methods of DataFrame.
  • rolling() method takes a sliding window of k datapoints and performs operations on it.