Python: Using Pandas to Read/Write Files

This will look at using Pandas to read, Excel, CSV and Text files into Python and saving the data as a variable.

Tutorial Video

Adding the Pandas Library to your Python Script

In order to use Pandas, we need to import it into our Python Script because we are going to commonly reference the Pandas library when selecting a function within the Pandas library, is recommended to import it as something small for instance:

Python

Reading An Excel File

In this example we will look at an xlsx file. It has a file name Book1 and is saved in the same folder as the Python Script. The Data we are interested in importing is in Sheet1.

We import pandas as pd and then use the read_excel function within the Pandas library. We input the filename and extension and sheet name as the first and second argument.

Python

This can be opened in the Variable Explorer:

File Path

Note the file is in the same folder as the Python Script, meaning only the file name is required.

If the file name is in a different folder the full file name is required, this can be copied and pasted from the address bar:

Python

Note Windows Explorer Gives the File Path as:

Plain Text

On the other hand, Python uses this format for the file path:

Python

The small r in front of the quoted file path will convert the file path given out from Windows Explorer to the relative file path, Python will recognise:

Python

Reading A Comma Separated Value File

We can open a comma (,) separated values (csv) file in Notepad++ and have a look at it.

To understand the file in more detail we can select View ↓ Show Symbol → Show All Characters.

This is the same data as in the Excel Spreadsheet, instead of seeing the grid we see that a , is used to separate 1 (column 1) from 2 (column 2) in the first row. This comma is used as a delimiter or separator. The CR,LF mark after 2 stands for Carriage Return, Line Feed (historically this was a printer to feed the paper through by one line, and return of the ink carriage to the left side of the page). These days in essence an indicator for a New Line.

When importing this data, a second argument sep can be used to indicate the symbol used for a separator. The word delimiter is an alias for separator and carries out the same function when used.

Python

Reading a Tab Delimited Text File

The read_csv command is also used for text files. If the Excel File is saved as Text, (Tab Delimited) opposed to CSV (comma delimited) it can once again be opened in Notepad++ and once again all symbols can be shown.

Python

Reading an Excel File with Column Names

By default Pandas will use the first row as the Column Names. For example in the Excel File we read the these values are taken as 1 and 2 respectively. This should in actual fact be data and not the Column Names:

To get around this we can specify the names of the columns using:

Python

Note how the Column Names are now x and y and not 1 and 2 as before:

Although we have updated the names of the columns, we can see that we have still skipped the data in the first row. To get around this we need another argument:

Python

Now you can see the dimensions are 11 rows by 2 columns opposed to 10 rows by 2 columns:

where we have listed the data in reverse order, reading in the file as before gives:

The Headers (Column Names) are taken as 12 and 22 respectively:

Quite often an Excel File will have Headers, and these will automatically be loaded.

This data is loaded with the Headers without needing to specify the input arguments name or header:

Reading an Excel File with Comments

Sometimes your data files will have comments in them. For example Row 1 and Row 3 in this Excel Spreadsheet. These comments are marked with a #. Note in Python we Index from 0 so the comments are at Row 0 and Row 2 respectively. The Column Names or Header is at Row 1. When importing we can specify to remove the comments using the input argument comment and then specifying the symbol in this case #.

Python

Using the code above gives:

Note the mistake. The Column Names are again 1 and 2 and not x_data and y_data as expected. This is because the way the read works. The original line 0 was a comment and was ignored, so the original line 1 became line 0. The original line 2 was also a comment so is ignored. The original line 3 now becomes line 1 (because two lines of comments were ignored). Specifying the input argument header=1, selects this line as the header opposed to the line 0 which is the correct header.

Python

Updating the code reads the file correctly:

Skip Initial Space and Rows

In the following CSV file, there are 3 blank lines and these are not marked as comments. The lines do not begin with #. These can be skipped using the initial input argument, skipinitialspace and setting it to True

Python

It is also possible to set skiprows to an array. Take into account that we index from 0 in Python whereas notepad++ starts on line 1.

Python

Not a Number (NaN) Values

If we create a value 10/0 which Excel cannot calculate it will return an error #DIV/0! this is not a numeric value and is Not a Number (NaN).

This can be added as an additional argument na_values:

Python

When imported the Not a Number (nan) appears in the field:

This can also be done in a CSV file. We can import using the additional argument as keep_default_na as True:

Python

This loads the value up as nan:

Or if we set keep_default_na as False

Python

We simply get a blank field:

Decimal and Thousands

Some European Countries for example Germany use the semi-colon ; as a separator (delimiter), the comma , as a decimal point and the dot . to indicate thousands.

This CSV file can be read in Notepad++, here we see that the semi-colon ; is used as the separator and the , is used as a decimal point.

When read in, the data looks like this:

If the file format has numbers in this format additional input argument called decimal can be input:

Python

When viewed, the decimal place shows as normal:

Renaming Columns

If we wish to rename the Columns ‘x_data’, ‘y_data’ to ‘t’ and ‘v’ respectively we can use the following command:

Python

Indexing of Columns and Rows

The following data can be imported:

Python

If we examine the data, we can see the Column Names:

These are:

Column NumberColumn Name
0‘Name’
1‘Score 1’
2‘Score 2’
3‘Score 3’

They may be used to select the data in each column, say for example one wants to look at Score 1, they may use:

Python
0    10
1    12
2    12
3    14
Name: Score 1, dtype: int64

You’ll note however that the default Index (Row Names) is 0,1,2,3. If we want to change these, to the value in Name, the data can be imported using an additional argument:

Python

Looking at the imported data we can now see the updated Column Index:

Row NumberRow Name
0‘Philip’
1‘Jess’
2‘Maja’
3‘Jordan’

A Row may be selected by using its name for example:

Python
Score 1    10
Score 2     9
Score 3    15
Name: Philip, dtype: int64

And a single point may be selected using:

Python
12

It is also possible to index calling both the columns and rows by their numbers, instead of using location loc we use integer location iloc. In the case above, ‘Jess’ is in the 1st row and Score 1 is in the 0th row. Thus the following gives the same result:

Python
12

Saving to Excel or CSV

The following command can be used to save to the xlsx format or csv format:

Python
Python

The files should be created:

Here is the xlsx Excel File for Example:

Advertisements

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.