I/O with NumPy

Learn via video courses
Topics Covered

Overview

The term I/O is used to describe any program, operation or device that transfers data to or from a computer and to or from a peripheral device. All the millennials remember using a flash drive to transfer games from one computer to another. Well, that's what we are trying to implement here! Every transfer is an output from one device and an input into another. This article will address how to perform I/O with NumPy, as well as numerous Python functions that will make our lives easier as software developers.

Introduction

All data and programs are "written" into and "read" out of a file, which is the most basic unit in a computer. They are used to keeping data on a hard disk for a long time. For storing data in such a way that it is stored for future use, and in a permanent manner, we use files.

Taking a path down nostalgia, all of us remember the era of CDs (Compact Disk). If we have a film to transfer from one computer to another, we would write our file into a CD, and then give it to our friends.

Throughout this article, we will be learning how to manipulate files; how to use them to store our data and retrieve it.

Basic I/O with NumPy Functions

  1. load() and save() When it comes to I/O with NumPy, loading and saving have to be the most basic task in Computer Science. The applications of this feature are limitless. If we are looking to deploy any machine learning model using Flask, we have to save our model using pickle, and then load it using the load() function.
  • save() function: If we want to save an image to our disk using Python, we will do the following:

    Code to save an image:

  • load() function: load() function is used to load any file (preferably in pickle) from our disk to our workspace.

    Code to load an image:

  1. loadtxt() and savetxt() Most of the data that programmers receive, comes in the form of Excel sheets and text files. The loadtxt() and savetxt() functions allow us to load and save text files on our devices, respectively while performing I/O with NumPy.
  • loadtxt() function: loadtxt() function is used to load a text file (preferably in .txt format) from our disk to our workspace. Code to use loadtxt():
  • savetxt() function: savetxt() function is used to save a text file (preferably in .txt format) from our workspace into our local device. Code to use savetxt():
  1. savez() savez() functions help us to save and store multiple arrays in a single file, in a .npz format. The file that will be stored after the execution of the program, will be in an uncompressed form. Code to use savez():

Importing Data Using Genfromtxt

The genfrmtxt() function in Python is highly similar to the load() function; the only difference is that in genfrmtxt(), we can handle missing information in whatever way we choose.

  • How do we define the input in genfrmtxt()? There are many parameters in the genfrmtxt() function, but the input (source of data) argument is the most crucial and mandatory.
    The input can be anything; an array, a list of strings, a set, etc. genfrmtxt() accepts a lot of file types in Python. The most common file types are gzip and bzip2 archives. .gz - a gzip file is required .bz2 - a bzip2 file is required

  • How do we split the lines into columns? genfromtxt() separates each line into a series of strings once the file has been specified and opened for reading. One of the parameters is called delimiter, and it is used to tell the compiler how to split our data. If we want to split a .csv(comma-separated) file, we will use a comma (,) as our delimiter. Conversely, we can also use a semicolon(;). Code to split lines into columns:

  • How do we choose the columns? When we work on a dataset, we will realize that we don't need all the columns for our observation to make sense. There arises the need for selecting only the important columns that we require. To do this, we use the usecols function. With the usecols() parameter, we can specify which columns to import. This parameter takes a single integer or a series of integers that represent the column indices to import. Code to choose columns:

  • How do we choose the data types? To choose the data type for files, we use the dtype parameter. The values that dtype allows to are: a single data type: This includes float, integers, and double.

Note: float is the default data type for genfrmtxt().

comma-separated string: i2, g3 (where i2, g3 are indices.) sequence of tuples/data types If you don't want to pass any value, you simply use None.

  • How to set names? When it comes to working with tables, it is good practice to assign specific names to your columns. This is done to make your workspace and dataset cleaner. Hence you will be able to get more work done. Code to set names:

    Here, we use the "StringIO" library to convert anything into a file. It takes in one argument, and whatever the argument is, StringIO will convert it into a Python file. As you can see in the code, the data type of each tuple comes out to be "f8". F8 means that the variable is in float format. *This proves our comment that the default data type for genfrmtxt() is float.

  • How do we tweak our conversions? When we work with genfrmtxt() using strings, we want to make sure that the strings that get returned, as a result, should be of the data type that we want. For example, if I want my output file to have float values, I would pass None, hence it would return float values. If I am working with time-series data, I want to make sure that my dates should be in a specific format (for eg: DD/MM/YYYY). Tweaks like this can be performed using the converters parameter in genfrmtxt(). Code for tweaking our conversions:

As we can see, we are not able to convert 2.3% and 78.9% into float. Hence, we have nan in our output. We will solve this problem using converters:

  • Additional Functions in genfrmtxt(): While performing I/O with NumPy, dealing with missing values has to be one of the main problems faced by people who work with data in Python. It is also one of the most important steps in a Machine Learning project (data-cleaning). genfrmtxt() has two functions that allow us to find missing data, and then fill it accordingly. missing_values: This function helps us to recognize missing values. The formats that come under missing values are N/A and ???. filling_values: This function helps us to fill missing values. We can fill our columns using either bool, int, float, complex or string data-type.

Conclusion

Let's recap the things and concepts we learnt in this article:

  • I/O is used to describe any program, operation or device that transfers data to or from a computer and to or from a peripheral device.
  • We covered functions like save() (helps us to save files) and load() (used to load files from local systems).
  • Apart from this, we also covered savetxt() (saving a text file) and loadtxt() (loading text file form local system).
  • We covered the genfrmtext() function, which is similiar to load() function, it just let's handle missing data in whatever way we choose.
  • Apart from the usual parameters of genfrmtxt(), we covered additional functions like delimeter, that will make the data-cleaning process easier.