Pandas Read Text File | Working with Text Data

Learn via video courses
Topics Covered

Overview

We all know that Pandas is a very useful library for csv files with its wide range of operations and easy-to-implement functions. Did you know that we can perform the same features of pandas on text files as well ? Pandas functions have full authority over text files and we can create DataFrames from them just like csv files, this article revolves around the handling of text files in pandas.

Introduction to Working with Text Data

The Pandas Python package has been the default package for performing data transformations, data analysis, and data manipulation. Most of the time, this data resides in various external sources, such as text files. In such cases, we need to load text data from these files and then convert them into DataFrames. Pandas offer a variety of methods for loading data from such sources.

How to Read Text Files with Pandas ?

There are multiple ways in which we read a text file using Pandas. We will look into all these methods one by one. We first create a text file of our own by adding some data into it and saving it with a .txt extension.

Text.txt :

Method - 1 : Using read_csv()

What exactly does csv stand for ? CSV is a comma-separated file i.e. any text file that uses commas as a delimiter to separate the record values for each field. Therefore, to load data from a text file we use the pandas.read_csv() method, even if the file itself does not have a .csv extension.

If you look into the general syntax for a read_csv method there are various parameters but not all of these are important. To read our text file and load it into pandas DataFrame all we need is to provide the filename, the separator/delimiter (whitespace in this case), and the row containing the names of the columns which is most probably the first row.

Code Example - 1 :

Output :


Code Example - 2 :

In the previous example, we did not use the Header parameter. In this example, we set the value for the header parameter as None. This creates a default header in the output. And thus consider the first line of the text file as data entry and not as labels. The header name in the output will be a number starting from 0.

Output :


Code Example - 3 :

In the previous example, we created a header name whose value started with 0 but if we want to assign names to our header instead of a number we can do so by providing values to the names parameter and setting the header parameter to None. In this example, we will see how to create a header with a name using pandas.

Output :

Method - 2 : Using read_table()

Another way to read data from a text file is by using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is very much like the read_csv() function, the major difference being that in read_table the default value of the delimiter is '\t' and not a comma which is the default value for read_csv(). We will read data with the read_table function making the separator equal to a single space(‘ ‘).

Code Example :

Output :

Method - 3 : Using read_fwf()

What does fwf in the read_fwf() function stand for? It stands for fixed-width lines. This function is used to load DataFrames from files. Another very interesting feature is that it supports optionally iterating or breaking the file into chunks. Since the columns in the text file were separated with a fixed width, this read_fwf() read the contents effectively into separate columns.

Code Example :

Output :

String Methods in Pandas

click here to know more about string methods in pandas

In this section, we will look into various string methods and their functions.

FUNCTIONDESCRIPTION
str.lower()This method is used to convert a string’s characters to lowercase
str.upper()This method is used to convert a string’s characters to uppercase
str.strip()This method is used to remove leading and trailing whitespace from string
str.lstrip()This method is used to remove whitespace from the left side (beginning) of a string
str.rstrip()This method is used to remove whitespace from the right side (end) of a string
str.find()This method is used to search for a substring in each string present in a series
str.findall()This method is used to find substrings or separators in each string in a series
str.rfind()This method is used to search a substring in each string present in a series from the Right side
str.isdigit()This method is used to check if all characters in each string in the series are digits
str.isalpha()This method is used to check if all characters in each string in the series are alphabetic(a-z/A-Z)
str.isdecimal()This method is used to check whether all characters in a string are decimal
str.title()This method is used to capitalize the first letter of every word in a string
str.len()This method returns a count of the number of characters in a string
str.replace()This method replaces a substring within a string with another value that the user provides
str.contains()This method tests if a pattern or regex is contained within a string of a Series or Index
str.extract()This method extracts groups from the first match of regular expression pattern.
str.startswith()This method tests if the start of each string element matches a particular pattern.
str.endswith()This method tests if the end of each string element matches the given pattern.
str.split()This method splits a string value, based on an occurrence of a user-specified value.
str.join()This method is used to join all elements in a list present in a series with a passed delimiter.
str.cat()This method is used to concatenate strings to the passed caller series of strings.
str.repeat()This method is used to repeat string values in the same position of the passed series itself.
str.get()This method is used to get an element at the passed position.
str.partition()This method splits the string only at the first occurrence unlike str.split().
str.rpartition()This method splits the string only once and that too reversely. It works in a similar way to str.partition() and str.split()
str.pad()This method is used to add padding (whitespaces or other characters) to every string element in a series
str.swapcase()This method is used to swap the case of each string in the given series.

Operations Using Different Strings Methods

click here to know more about operations using different string methods in pandas

There are different types of operations to be performed on data especially when it is about string data the manipulations can be multiple. But of these some operations happen pretty often, for example, changing a string from lower to upper case and vice versa or finding the length of a string for such operations, we have some common string methods in Pandas. We first create a dataframe and then perform some common string operations on it.

Code Example - 1 :

Output:


Code Example - 2 :

Output :

Series.str.upper()

As the name of the method suggests this method is used to change the case of the string to Uppercase.

Code Example :

Output :

Series.str.lower()

As the name of the method suggests this method is used to change the case of the string to lowercase.

Code Example :

Output :

Series.str.isUpper()

It checks whether all characters in each string in the Index of the DataFrame are in upper case and returns a Boolean value.

Code Example :

Output :

Series.str.islower()

It checks whether all characters in each string in the Index of the Data-Frame are in lowercase and returns a Boolean value.

Code Example :

Output :

Series.str.len()

It returns the length of the string and in case of an empty string, it returns NaN.

Code Example :

Output :

title()

It converts the first letter of each word to uppercase and leaves the remaining in lowercase and returns the output. As we can see in the below example just the first letter of each data element has been converted to uppercase rest are in lowercase.

Code Example :

Output :

Various other operations can be performed on string data.

Regex Filtering with Pandas

click here to know more about regex filtering with pandas.

A regular expression (regex) is defined as a sequence of characters that define a search pattern. To filter rows in Pandas by using regex, we use the str.match() method.

Code Example :

Output :

Advanced String Methods

click here to know more about advanced string methods in pandas.

Certain string manipulation operations are not required as frequently as operations like changing string from lower to uppercase and vice versa, calculating the length of the string, and so on. Operations like counting the occurrence of a character, or replacing a character with another one are some advanced operations. For such operations, we use advanced string methods like count(), replace(), etc.

Code Example :

Output :

Series.str.count()

It returns the count of the appearance of a character or pattern in each element in DataFrame. In the example given below, we can see that we are counting the occurrence of 'n', and only in the data at index 1 do we have n occurring twice and hence the output is 2, in the rest of the cases it is 0.

Code Example :

Output :

Conclusion

In this article, we gained insights on how to work with text data using pandas and looked into the three main methods and how exactly these methods help convert text data from files into dataframes.

  • read_csv() :
    This method works for a comma separated file.
  • read_table() :
    This method works for '\t' i.e. whitespace separated file.
  • read_fwf() :
    This method helps us in loading DataFrame directly from files.

Next, we looked into some common and advanced string methods in pandas.

  • Series.str.upper()/Series.str.lower() :
    this method helps in changing the case of the string to uppercase and lowercase respectively.
  • Series.str.le n() :
    this method helps to find the length of the string data.
  • Series.str.count() :
    this method counts the occurrence of a character or pattern in the given data.
  • Next, we looked into Regex Filtering which helps us filter rows based on regular expression/ pattern of characters using str.match() function.

Working with text data is something you will encounter daily in the field of machine learning or data science. The only way to learn these methods is by making your hands dirty because you will never be able to learn it all. So use these methods on real data, pick up any random dataset, and see how these methods work on it.