Python: Using Pandas to Read/Write Files

This page is obsolete, I wrote it while I was just starting to learn python. I have wrote a much better set of notes on pandas available here:

An Introduction to Scientific Programming with Python

This will look at using Pandas to read, Excel, CSV and Text files into Python and saving the data as a variable.

Table of contents

Tutorial Video
Adding the Pandas Library to your Python Script
Reading An Excel File

File Path
Reading A Comma Separated Value File
Reading a Tab Delimited Text File

Reading an Excel File with Column Names
Reading an Excel File with Comments
Skip Initial Space and Rows

Not a Number (NaN) Values
Decimal and Thousands
Renaming Columns

Indexing of Columns and Rows
Saving to Excel or CSV

Tutorial Video

Adding the Pandas Library to your Python Script

In order to use Pandas, we need to import it into our Python Script because we are going to commonly reference the Pandas library when selecting a function within the Pandas library, is recommended to import it as something small for instance:

import pandas as pd

Reading An Excel File

In this example we will look at an xlsx file. It has a file name Book1 and is saved in the same folder as the Python Script. The Data we are interested in importing is in Sheet1.

We import pandas as pd and then use the read_excel function within the Pandas library. We input the filename and extension and sheet name as the first and second argument.

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book1.xlsx','Sheet1')

This can be opened in the Variable Explorer:

File Path

Note the file is in the same folder as the Python Script, meaning only the file name is required.

If the file name is in a different folder the full file name is required, this can be copied and pasted from the address bar:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel(r'C:\Users\Phili\OneDrive\Documents\Python\Book1.xlsx','Sheet1')

Note Windows Explorer Gives the File Path as:

C:\Users\Phili\OneDrive\Documents\Python\Book1.xlsx

On the other hand, Python uses this format for the file path:

C://Users//Phili//OneDrive//Documents//Python//Book1.xlsx

The small r in front of the quoted file path will convert the file path given out from Windows Explorer to the relative file path, Python will recognise:

r'C:\Users\Phili\OneDrive\Documents\Python\Book1.xlsx'

Reading A Comma Separated Value File

We can open a comma (,) separated values (csv) file in Notepad++ and have a look at it.

To understand the file in more detail we can select View ↓ Show Symbol → Show All Characters.

This is the same data as in the Excel Spreadsheet, instead of seeing the grid we see that a , is used to separate 1 (column 1) from 2 (column 2) in the first row. This comma is used as a delimiter or separator. The CR,LF mark after 2 stands for Carriage Return, Line Feed (historically this was a printer to feed the paper through by one line, and return of the ink carriage to the left side of the page). These days in essence an indicator for a New Line.

When importing this data, a second argument sep can be used to indicate the symbol used for a separator. The word delimiter is an alias for separator and carries out the same function when used.

import numpy as np
import pandas as pd
# In a comma seperated CSV file the comma is the delimiter. 
# It can be specified as an additional argument.
csv_dataframe=pd.read_csv('Book1.csv', sep=',')
# delimiter acts as an alias for sep
csv_dataframe=pd.read_csv('Book1.csv', delimiter=',')

Reading a Tab Delimited Text File

The read_csv command is also used for text files. If the Excel File is saved as Text, (Tab Delimited) opposed to CSV (comma delimited) it can once again be opened in Notepad++ and once again all symbols can be shown.

import numpy as np
import pandas as pd
# In a text tab /t seperated file the tab \t is the delimiter. 
# It can be specified as an additional argument.
txt_dataframe=pd.read_csv('Book1.txt', sep='\t')

Reading an Excel File with Column Names

By default Pandas will use the first row as the Column Names. For example in the Excel File we read the these values are taken as 1 and 2 respectively. This should in actual fact be data and not the Column Names:

To get around this we can specify the names of the columns using:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book1.xlsx','Sheet1',names=['x','y'])

Note how the Column Names are now x and y and not 1 and 2 as before:

Although we have updated the names of the columns, we can see that we have still skipped the data in the first row. To get around this we need another argument:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book1.xlsx','Sheet1',header=None,names=['x','y'])

Now you can see the dimensions are 11 rows by 2 columns opposed to 10 rows by 2 columns:

where we have listed the data in reverse order, reading in the file as before gives:

The Headers (Column Names) are taken as 12 and 22 respectively:

Quite often an Excel File will have Headers, and these will automatically be loaded.

This data is loaded with the Headers without needing to specify the input arguments name or header:

Reading an Excel File with Comments

Sometimes your data files will have comments in them. For example Row 1 and Row 3 in this Excel Spreadsheet. These comments are marked with a #. Note in Python we Index from 0 so the comments are at Row 0 and Row 2 respectively. The Column Names or Header is at Row 1. When importing we can specify to remove the comments using the input argument comment and then specifying the symbol in this case #.

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book3.xlsx','Sheet1',comment='#',header=1)

Using the code above gives:

Note the mistake. The Column Names are again 1 and 2 and not x_data and y_data as expected. This is because the way the read works. The original line 0 was a comment and was ignored, so the original line 1 became line 0. The original line 2 was also a comment so is ignored. The original line 3 now becomes line 1 (because two lines of comments were ignored). Specifying the input argument header=1, selects this line as the header opposed to the line 0 which is the correct header.

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book3.xlsx','Sheet1',comment='#',header=0)

Updating the code reads the file correctly:

Skip Initial Space and Rows

In the following CSV file, there are 3 blank lines and these are not marked as comments. The lines do not begin with #. These can be skipped using the initial input argument, skipinitialspace and setting it to True

import numpy as np
import pandas as pd
csv_dataframe=pd.read_csv('Book4.csv',sep=',',comment='#',skipinitialspace=True)

It is also possible to set skiprows to an array. Take into account that we index from 0 in Python whereas notepad++ starts on line 1.

import numpy as np
import pandas as pd
csv_dataframe=pd.read_csv('Book4.csv',sep=',',comment='#',skiprows=[0,1,2])

Not a Number (NaN) Values

If we create a value 10/0 which Excel cannot calculate it will return an error #DIV/0! this is not a numeric value and is Not a Number (NaN).

This can be added as an additional argument na_values:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_csv('Book5.csv',comment='#',header=0,na_values='#DIV/0!')

When imported the Not a Number (nan) appears in the field:

This can also be done in a CSV file. We can import using the additional argument as keep_default_na as True:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_csv('Book5.csv',sep=',',comment='#',header=1,na_values='#DIV/0!',keep_default_na=True)

This loads the value up as nan:

Or if we set keep_default_na as False

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_csv('Book5.csv',sep=',',comment='#',header=1,na_values='#DIV/0!',keep_default_na=False)

We simply get a blank field:

Decimal and Thousands

Some European Countries for example Germany use the semi-colon ; as a separator (delimiter), the comma , as a decimal point and the dot . to indicate thousands.

This CSV file can be read in Notepad++, here we see that the semi-colon ; is used as the separator and the , is used as a decimal point.

When read in, the data looks like this:

If the file format has numbers in this format additional input argument called decimal can be input:

import numpy as np
import pandas as pd
csv_dataframe=pd.read_csv('Mappe1.csv',sep=';',decimal=',',thousands='.',Header=[0])

When viewed, the decimal place shows as normal:

Renaming Columns

If we wish to rename the Columns 'x_data', 'y_data' to 't' and 'v' respectively we can use the following command:

xlsx_dataframe.columns=['t','v']

Indexing of Columns and Rows

The following data can be imported:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book6.xlsx','Sheet1',header=0)

If we examine the data, we can see the Column Names:

These are:

Column Number	Column Name
0	'Name'
1	'Score 1'
2	'Score 2'
3	'Score 3'

They may be used to select the data in each column, say for example one wants to look at Score 1, they may use:

print(xlsx_dataframe['Score 1'])

0    10
1    12
2    12
3    14
Name: Score 1, dtype: int64

You'll note however that the default Index (Row Names) is 0,1,2,3. If we want to change these, to the value in Name, the data can be imported using an additional argument:

import numpy as np
import pandas as pd
xlsx_dataframe=pd.read_excel('Book6.xlsx',header=0,index_col='Name')

Looking at the imported data we can now see the updated Column Index:

Row Number	Row Name
0	'Philip'
1	'Jess'
2	'Maja'
3	'Jordan'

A Row may be selected by using its name for example:

print(xlsx_dataframe.loc['Philip'])

Score 1    10
Score 2     9
Score 3    15
Name: Philip, dtype: int64

And a single point may be selected using:

print(xlsx_dataframe.loc['Score 2','Jess'])

It is also possible to index calling both the columns and rows by their numbers, instead of using location loc we use integer location iloc. In the case above, 'Jess' is in the 1st row and Score 1 is in the 0th row. Thus the following gives the same result:

print(xlsx_dataframe.iloc[1,0])

Saving to Excel or CSV

The following command can be used to save to the xlsx format or csv format:

xlsx_dataframe.to_excel('Export1.xlsx')

xlsx_dataframe.to_csv('Export1.csv')

The files should be created:

Here is the xlsx Excel File for Example:

Tutorial Video

Adding the Pandas Library to your Python Script

Reading An Excel File

File Path

Reading A Comma Separated Value File

Reading a Tab Delimited Text File

Reading an Excel File with Column Names

Reading an Excel File with Comments

Skip Initial Space and Rows

Not a Number (NaN) Values

Decimal and Thousands

Renaming Columns

Indexing of Columns and Rows

Saving to Excel or CSV

Like this:

Leave a ReplyCancel reply

Tutorial Video

Adding the Pandas Library to your Python Script

Reading An Excel File

File Path

Reading A Comma Separated Value File

Reading a Tab Delimited Text File

Reading an Excel File with Column Names

Reading an Excel File with Comments

Skip Initial Space and Rows

Not a Number (NaN) Values

Decimal and Thousands

Renaming Columns

Indexing of Columns and Rows

Saving to Excel or CSV

Share this:

Like this:

Leave a ReplyCancel reply