What is Pandas DataFrame sample() Method?
The pandas sample() method is used to get a subset of random rows or columns from pandas dataframe. If applied to a pandas Series, this method returns a subset of random items from that Series.
The most common usage of the pandas sample() method is to get a sample of random rows from a pandas data frame. By default, it returns a single random row. In practice, however, we usually provide a specific number or fraction of rows to return.
Syntax
For a pandas dataframe the syntax of the sample() method is as follows:
For a pandas Series, the syntax is almost identical, except that it doesn't have the axis parameter:
Parameters
- n – an integer value representing the number of random axis items (rows or columns) to return. By default, the method returns one item. This parameter can't be used in combination with the frac parameter.
- frac – a float value representing the fraction of random axis items (rows or columns) to return. By default, the method returns 1/N items (virtually meaning a single item), where N is the overall number of items. This parameter can't be used in combination with the n parameter.
- replace – a Boolean value indicating if sampling is performed with or without (by default) replacement. If set to True, the pandas sample() method can return the same item more than once.
- weights – determines the influence of specified axis items (rows or columns) on the result of sampling. By default, all axis items have equal weights. The weights parameter can be set to a string, a list, or a Series. Except for the last case, where the provided Series is aligned with the original object based on the index, weights must be of the same length as the axis being sampled. When sampling random rows from a pandas dataframe, this parameter can take in a column name. In this case, missing values in that column will be treated as zero.
- random_state – allows the reproducibility of results of applying the pandas sample() method. This parameter, when set to an integer, specifies the seed for the random number generator. Alternatively, it can be set to a NumPy random generator objects itself. With the random_state argument passed in, the method returns the same axis items each time.
- axis – takes in the axis index as an integer (0 for 1) or its name as a string ('index' or 'columns'). For the pandas dataframe, to sample random rows, we need to set this parameter to 0 or 'index', while to sample random columns – 1 or 'columns'. By default, the method samples random rows. When we apply the pandas sample() method on a Series, this parameter isn't used since it doesn't make sense for a one-axis object.
- ignore_index – can be either True (the initial index is ignored and replaced by a 0-based index) or, by default, False (the initial index is preserved). This parameter appeared in pandas version 1.3.0.
Note that all the parameters of the pandas sample() method are optional.
Returns
A new `pandas sample data frame or Series (depending on the original object) with a subset of rows or columns from the original object.
Examples
Let's look at various scenarios of using the pandas sample() method. To start with, we'll import the pandas library and read in a Kaggle dataset to make our experiments on – Dating Apps Reviews 2017-2022 (all regions).
Code:
Output:
To make our work easier, let's take only ten random rows from the dataframe:
Code:
Output:
Yes, we've already started using the pandas sample() method in the above piece of code 🙂
Note that the code output is adjusted as per page size and all the columns are in the same alignment.
Random Row from a Dataframe
To get a single random row from our dataframe, we apply the pandas sample() method on a dataframe without passing in any arguments (i.e., using the default values for all the parameters).
Code:
Output:
If we re-run the above piece of code, we'll most probably get another single random row since we didn't pass in random_state.
Generating a 50% Sample of a Dataframe
To extract a random 50% pandas sample dataframe (meaning that we want to get a sample of rows), we need to run the following code:
Code:
Output:
If we want to add reproducibility to the above piece of code (i.e., if we want to ensure we'll always get the same rows running that code), we need to set the random_state parameter:
Code:
Output:
In some cases, we may want to get a random 50% sample of columns of a dataframe:
Code:
Output:
Restoring Dataframe Index
In the majority of the above examples, the index of the resulting panda's sample dataframe isn't sequential anymore. Let's take a piece of code from the previous section and restore a sequential 0-based index for it. To do so, we need to pass in ignore_index=True:
Code:
Output:
Note that the index is now fixed, and the method has returned the same rows as in the previous piece of code.
An alternative to the above piece of code is chaining the pandas sample() and reset_index() methods:
Code:
Output:
Note that above, we passed in the drop=True parameter in the reset_index() method to get rid of the old index and keep only the 0-based sequential one.
Let’s confirm that these two approaches give identical results using the pandas equals() method:
Code:
Output:
Simple Sample Setting n
To fetch the exact number of random rows from our dataframe, we set this number to the n parameter:
Code:
Output:
Instead, to fetch the exact number of random columns, we add also axis=1 (or axis='columns'):
Code:
Output:
Simple Sample Setting frac
Earlier, we've already tried extracting a 50% sample of our dataframe. Instead of 50%, we can use any other proportion by setting it to the frac parameter:
Code:
Output:
In the same way, as we did earlier, we can make the results reproducible with the random_state parameter, extract columns instead of rows with axis=1, and restore a sequential 0-based index with ignore_index=True.
Note that the n and frac parameters can't be passed in together to the pandas sample() method.
Sample Setting n and Replace
In some cases, we may want to extract the exact number of random rows from a dataframe and, at the same time, allow returning the same row more than once. In this case, we should use the n and replace parameters together:
Code:
Output:
Note that the row with index four has been returned twice.
If we want to set the n parameter to an integer that is greater than the number of rows in our dataframe, we must pass in replace=True since, in this case, at least one row has to be inevitably repeated:
Code:
Output:
In the above output, we see that the row with index four has been returned twice, the row with index 1 – three times, and the row with index 8 – four times.
We may need to use this approach to upsample (i.e., artificially augment) the available data.
Analogically, if we provide a fraction of rows to be returned rather than their exact number (the frac parameter instead of n) if this proportion is greater than 1, we must pass in replace=True:
Code:
Output:
Sample with weights
Finally, let's try to give more importance to a specific column using the weights parameter for the pandas sample() method to extract the necessary number or proportion of rows based on the values in that column. Before doing so, let's take a look one more time at our dataframe:
Code:
Output:
In our case, we can use only the Rating column since it contains integer numbers. The #ThumbsUp column is also numeric, but it has the same value in each row, hence can't help but use it as weight.
Code:
Output:
From the above output, we can see a preference in sampling was given to those rows that have high values in the Rating column.
Conclusion
- The pandas sample() method is used to get a subset of random rows or columns from a dataframe or random items from a Series.
- The method returns a new pandas sample dataframe or Series depending on the original object.
- We can do many things with the pandas sample() method, here are some of the examples:
- extract an exact number or a specific fraction of rows or columns.
- ensure reproducibility of results.
- reset the index.
- allow returning the same row or column more than once.
- conduct Conductsampling.
- give more importance to specified axis items when sampling.