This will look at using Pandas to read, Excel, CSV and Text files into Python and saving the data as a variable.
Table of contents
- Adding the Pandas Library to your Python Script
- Reading An Excel File
- File Path
- Reading A Comma Separated Value File
- Reading a Tab Delimited Text File
- Reading an Excel File with Column Names
- Reading an Excel File with Comments
- Skip Initial Space and Rows
- Not a Number (NaN) Values
- Decimal and Thousands
- Renaming Columns
- Indexing of Columns and Rows
- Saving to Excel or CSV
Adding the Pandas Library to your Python Script
In order to use Pandas, we need to import it into our Python Script because we are going to commonly reference the Pandas library when selecting a function within the Pandas library, is recommended to import it as something small for instance:
Reading An Excel File
In this example we will look at an xlsx file. It has a file name Book1 and is saved in the same folder as the Python Script. The Data we are interested in importing is in Sheet1.
We import pandas as pd and then use the read_excel function within the Pandas library. We input the filename and extension and sheet name as the first and second argument.
This can be opened in the Variable Explorer:
Note the file is in the same folder as the Python Script, meaning only the file name is required.
If the file name is in a different folder the full file name is required, this can be copied and pasted from the address bar:
Note Windows Explorer Gives the File Path as:
On the other hand, Python uses this format for the file path:
The small r in front of the quoted file path will convert the file path given out from Windows Explorer to the relative file path, Python will recognise:
Reading A Comma Separated Value File
We can open a comma (,) separated values (csv) file in Notepad++ and have a look at it.
To understand the file in more detail we can select View ↓ Show Symbol → Show All Characters.
This is the same data as in the Excel Spreadsheet, instead of seeing the grid we see that a , is used to separate 1 (column 1) from 2 (column 2) in the first row. This comma is used as a delimiter or separator. The CR,LF mark after 2 stands for Carriage Return, Line Feed (historically this was a printer to feed the paper through by one line, and return of the ink carriage to the left side of the page). These days in essence an indicator for a New Line.
When importing this data, a second argument sep can be used to indicate the symbol used for a separator. The word delimiter is an alias for separator and carries out the same function when used.
Reading a Tab Delimited Text File
The read_csv command is also used for text files. If the Excel File is saved as Text, (Tab Delimited) opposed to CSV (comma delimited) it can once again be opened in Notepad++ and once again all symbols can be shown.
Reading an Excel File with Column Names
By default Pandas will use the first row as the Column Names. For example in the Excel File we read the these values are taken as 1 and 2 respectively. This should in actual fact be data and not the Column Names:
To get around this we can specify the names of the columns using:
Note how the Column Names are now x and y and not 1 and 2 as before:
Although we have updated the names of the columns, we can see that we have still skipped the data in the first row. To get around this we need another argument:
Now you can see the dimensions are 11 rows by 2 columns opposed to 10 rows by 2 columns:
where we have listed the data in reverse order, reading in the file as before gives:
The Headers (Column Names) are taken as 12 and 22 respectively:
Quite often an Excel File will have Headers, and these will automatically be loaded.
This data is loaded with the Headers without needing to specify the input arguments name or header:
Reading an Excel File with Comments
Sometimes your data files will have comments in them. For example Row 1 and Row 3 in this Excel Spreadsheet. These comments are marked with a #. Note in Python we Index from 0 so the comments are at Row 0 and Row 2 respectively. The Column Names or Header is at Row 1. When importing we can specify to remove the comments using the input argument comment and then specifying the symbol in this case #.
Using the code above gives:
Note the mistake. The Column Names are again 1 and 2 and not x_data and y_data as expected. This is because the way the read works. The original line 0 was a comment and was ignored, so the original line 1 became line 0. The original line 2 was also a comment so is ignored. The original line 3 now becomes line 1 (because two lines of comments were ignored). Specifying the input argument header=1, selects this line as the header opposed to the line 0 which is the correct header.
Updating the code reads the file correctly:
Skip Initial Space and Rows
In the following CSV file, there are 3 blank lines and these are not marked as comments. The lines do not begin with #. These can be skipped using the initial input argument, skipinitialspace and setting it to True
It is also possible to set skiprows to an array. Take into account that we index from 0 in Python whereas notepad++ starts on line 1.
Not a Number (NaN) Values
If we create a value 10/0 which Excel cannot calculate it will return an error #DIV/0! this is not a numeric value and is Not a Number (NaN).
This can be added as an additional argument na_values:
When imported the Not a Number (nan) appears in the field:
This can also be done in a CSV file. We can import using the additional argument as keep_default_na as True:
This loads the value up as nan:
Or if we set keep_default_na as False
We simply get a blank field:
Decimal and Thousands
Some European Countries for example Germany use the semi-colon ; as a separator (delimiter), the comma , as a decimal point and the dot . to indicate thousands.
This CSV file can be read in Notepad++, here we see that the semi-colon ; is used as the separator and the , is used as a decimal point.
When read in, the data looks like this:
If the file format has numbers in this format additional input argument called decimal can be input:
When viewed, the decimal place shows as normal:
If we wish to rename the Columns ‘x_data’, ‘y_data’ to ‘t’ and ‘v’ respectively we can use the following command:
Indexing of Columns and Rows
The following data can be imported:
If we examine the data, we can see the Column Names:
|Column Number||Column Name|
They may be used to select the data in each column, say for example one wants to look at Score 1, they may use:
0 10 1 12 2 12 3 14 Name: Score 1, dtype: int64
You’ll note however that the default Index (Row Names) is 0,1,2,3. If we want to change these, to the value in Name, the data can be imported using an additional argument:
Looking at the imported data we can now see the updated Column Index:
|Row Number||Row Name|
A Row may be selected by using its name for example:
Score 1 10 Score 2 9 Score 3 15 Name: Philip, dtype: int64
And a single point may be selected using:
It is also possible to index calling both the columns and rows by their numbers, instead of using location loc we use integer location iloc. In the case above, ‘Jess’ is in the 1st row and Score 1 is in the 0th row. Thus the following gives the same result:
Saving to Excel or CSV
The following command can be used to save to the xlsx format or csv format:
The files should be created:
Here is the xlsx Excel File for Example: