Python and Data Analysis Library (pandas)

Introduction

Spyder is a Python Integrating Development Environment (IDE) which has a similar user interface to Matrix Laboratory (MATLAB) which is a Commercial product commonly used in the physical sciences and engineering fields. Python and the Spyder IDE are open-source and can be freely downloaded using the Anaconda Python Distribution.

This particular guide uses the Python and Data Analysis Library (which is abbreviated using lower case pandas) to organize and manipulate data in a dataframe (which can be thought of as equivalent to an Excel spreadsheet). The Spyder 4 IDE will be used in this example as it has a variable explorer which makes its easier to visualize the example data.

 Python with the numeric python library (numpy), the plotting library (matplotlib) and the Python and data analysis library (pandas) can be used to replace the commercial product MATLAB for most applications.

Tutorial Video

Installation

If you haven't already installed Python and Spyder 4 using the Anaconda Installation see.

Perquisites

It is not recommended to jump straight into PANDAs until you are familiar with the Core Python programming language and the Numeric Python Library NumPy. I have a guides on these here:

Importing the pandas Library

The pandas library is one of the most commonly used python libraries particularly in the field of data science as a consequence it is commonly imported using a 2 letter abbreviation, similar to the numpy library.

import numpy as np
import pandas as pd

The DataFrame Class

Vectors

Supposing we start out with the following data as numpy array vectors.

import numpy as np
x=np.array([0,1,2,3,4])
y1=np.array([0,2,4,6,8])
y2=np.array([0,1.5,3,4.5,6.0])

Matrix

In this case x is the independent axis and y1 and y2 are measurements taken at the values of x. It may be more convenient to group it together as a numpy array matrix. To do this we can explicitly set each numpy array vector to a column and concatenate them:

# Perquisites
import numpy as np
# Create Vectors
x=np.array([0,1,2,3,4])
y1=np.array([0,2,4,6,8])
y2=np.array([0,1.5,3,4.5,6.0])
# Assign Vectors to Columns
x=np.reshape(x,(len(x),1))
y1=np.reshape(y1,(len(y1),1))
y2=np.reshape(y2,(len(y2),1))
# Concatenate the Columns to form a Matrix
matrix=np.concatenate([x,y1,y2],axis=1)

Note however when we do so the original names of each data series (columns) in this case x, y1 and y2 are lost.

DataFrame

We can instead create another data structure which groups the data together like a matrix but proscribes a name to each column which is known as a dataframe.

To create a dataframe we need to import the pandas library. Note sometimes it is useful to run this line in the console, so Spyder has loaded pandas and can look up the help.

import pandas as pd

To create a dataframe we need to use the DataFrame class of the pandas library.

Note the CamelCaseCapitalization of the class DataFrame. This convention is used for third party classes in the Python programming language.

This naming convention distinguishes them from classes inbuilt into the Core Python programming language which are lower case and displayed in purple.

Although numpy is also a third-party library it does not use the CamelCaseCapitalization protocol for naming classes. The numpy library is the most commonly used python library and treated in many cases as though it is inbuilt.

We need to provide 2 keyword input arguments to the DataFrame class:

  • the data keyword input argument which in this case are a matrix of numbers
  • the columns keyword input argument which is a list of strings. The number of strings in the list must match the number of columns in the matrix.

Recall that keyword input arguments have a default value, if they are not provided the default value will be used. The index keyword input argument is set to a default of None meaning default integer numeric values will be assigned to the rows. If columns wasn't assigned, it would also default to using default numeric integers.

The instance of the DataFrame class is an object (variable) and is assigned an object name. In this case we use the lower case variable name dataframe:

import pandas as pd
dataframe=pd.DataFrame(data=matrix,columns=['x','y1','y2'])

The dataframe object is created has an index of [0,1,2,3,4] (default) and 3 columns called ['x','y1','y2'] (provided).

Note that the first keyword input argument data can also be used as a positional keyword input argument, so the following lines of code are equivalent:

dataframe=pd.DataFrame(matrix,columns=['x','y1','y2'])

The Index (row names) of this dataframe is numerical with values [0,1,2,3,4] but can be labelled by providing a list for the keyword input argument index. The index is sometimes referred to as a row or observation.

dataframe=pd.DataFrame(matrix,
                       columns=['x','y1','y2'],
                       index=['r0','r1','r2','r3','r4'])

Reading and Writing Excel Files

Comma Separated Values (CSV) and Tab Delimited (TXT) Files

A dataframe is analogous to a Microsoft Excel spreadsheet. We will start off with a spreadsheet created in Microsoft Excel. In this file we have two columns "A" and "B" which contain numeric values.

Let us save this to a comma separated values (.csv) file:

Now let's use an additional program Notepad++

to open this saved csv file. In this file we will select view→show symbol→show all characters to show what are usually hidden formatting/punctuation characters in order to help us better understand how data in the file is stored. Here we can see that a comma , acts a delimiter (separating each value in Column A from the corresponding value in Column B). At the end of each line is CR and LF. These stand for Carriage Return and Line Feed respectively. We need both as CR returns us to the left-hand side of the file and LF brings us onto a new line. The name of these punctuation marks originates from typewriters. To use a typewriter after each line, the carriage had to be returned to the left-hand side of the piece of paper (carriage return) and the piece of paper then had to be slid up by the space of a line (line feed).

We can alternatively save the file as a Text (Tab delimited) .txt file.

We can now open this up in Notepad++ and the only difference we will see between the two files is the delimiter. In the.csv file the comma , is used as the delimiter and in the text file, the  is used as the delimiter.

Again, punctuation marks like the  are usually hidden but we have enabled them in this case in order to better understand what is going on. We can also save the file as Microsoft's own .xlsx format. Note in this format that our data is contained in an Excel workbook that has Sheet1.

When we try and open this file up in Notepad++ we won't be able to intuitively understand what is going on and it will look like gibberish. This is because this file format has a larger number of features and each character in the workbook will have its size, font, font color etc. In this file format many more formatting characters are used and like CR and LF these are likewise shown using white characters on a white background.

The pandas library has a function called pd.read_csv() which can be used to read in csv and txt files. And although we cannot understand the xlsx file by use of notepad++, the pandas library also contains a similar function pd.read_excel() with more under the hood which can be used with the xlsx file format.

Reading a CSV File

It is easier to access files in the same folder as the python script as only the file name is required. Select File→Save As… and save the python script in the same folder as the CSV file:

In this case the folder is:

C:\Users\Philip\Python Scripts

Select Save.

When the script is ran, the Files location will show the folder of the script file. In this case the Book1.csv file displays.

We can use the function read_csv to read in a csv file. If we type this function with open parenthesis we can see the positional input argument filepath and several keyword input arguments. We need to provide the positional input argument and we can leave all the keyword input arguments at their default values.

As the folder is in the same location as the script the file name (including extension) can be provided as a string:

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
csv_file=pd.read_csv('Book1.csv')

It is also possible to use the full file path. If we right click it and copy it.

We can enclose it in ' ' to make a string. However when we try and run the code we get an error and this is because the \ is a special character in python.

To get a \ we need to either type in \\

folder='C:\\Users\\Phili\\Documents\\Python Scripts'

Alternatively we need to use a relative string:

folder=r'C:\Users\Phili\Documents\Python Scripts'

The value displays on the variable explorer:

Now we can use the string with the folder name and concatenate another \ as well as the filename including extension.

folder=r'C:\Users\Phili\Documents\Python Scripts'
file=folder+'\\'+'Book1.csv'
folder=r'C:\Users\Phili\Documents\Python Scripts'
file=folder+'\\'+'Book1.csv'
csv_file=pd.read_csv(file)

Notice that the folder location is specific to my user account C:\Users\Phili\ in windows this is known as a Windows Environmental variable and can be accesses in Windows Explorer using the %UserProfile%.

Unfortunately relative strings don't recognise the Windows 10 Environmental Variables natively and we must use the os library to access additional operating system functions.

The os library has a dictionary attribute environ which can be used to access Windows 10 Environmental Variables.

Recall a dictionary has the form:

d1={'key0':'value0','key1':'value1','key2':'value2'}

And we can index with square brackets by using the key:

d1['key0']

We can therefore use the key 'UserProfile' to return the location of the user profile:

Therefore we can also use:

userprofile=os.environ['UserProfile']
folder=userprofile+'\\'+r'Documents\Python Scripts'
file=folder+'\\'+'Book1.csv'
csv_file=pd.read_csv(file)

For simplicity we will take advantage of the fact the file is in the same folder as the python script.

If we look closer we'll notice that the 1st line is missing in the csv_file dataframe and this is because the values have been taken as the column names.

Note also that excel and notepad++ uses 1st order indexing and python uses 0th order indexing.

If we modify the csv file to have column names and reimport, we can see that these are instead used.

We can change the way that python imports the file by using the keyword input arguments.

If we return to the original file which does not possess headings we can override the behaviour to use the top line as the column names by changing the value of header from 'infer' to None. We can also use the keyword input argument names to specify our own column names as a list of strings.

Removing the headings from the csv file we can import the file as a dataframe by selecting header=None and assigning names to a list of strings.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
csv_file=pd.read_csv('Book1.csv',
                     header=None,
                     names=cnames,
                     index_col=None)

We may use the keyword input argument skiprows for a file with non-data comment lines, for example in the file below three lines are non-data comment lines and there is no header.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
csv_file=pd.read_csv('Book1.csv',
                     skiprows=3,
                     header=None,
                     names=cnames,
                     index_col=None)

It is also possible to read in only a selection of rows which can sometimes save time when previewing an excel file with a large number of rows for example using the keyword input argument nrows. Columns may also be selected by using the keyword input argument usecols and selecting the rows to use as a vector of the original row names. We can use the following if we only want to import the data from the first 5 data rows in column 1 of the csv file.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c1']
csv_file=pd.read_csv('Book1.csv',
                     skiprows=3,
                     header=None,
                     names=cnames,
                     index_col=None,
                     nrows=5,
                     usecols=[1])

Supposing we have the following csv file. Here the 0th column of the raw data contains the index names. We do not have the column names specified in the file and instead we want to create them when reading the file and assigning it to a dataframe. We can therefore use:

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
csv_file=pd.read_csv('Book1.csv',
                     header=None,
                     names=cnames,
                     index_col=None)

Supposing we have the following file and we use the following code:

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
# cnames=['c0','c1']
csv_file=pd.read_csv('Book1.csv',
                     header=None)

We may expect the file to have a list of index names and 2 columns. If we used the keyword input argument names and specified only 2 strings in the list cnames we would of got an error. This is because the list of strings is taken as a column and not the index names.

We can use the keyword input argument index_col to assign the index to one of the columns (using zero order indexing). In this case column 0.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
csv_file=pd.read_csv('Book1.csv',
                     header=None,
                     names=cnames)

Reading a TXT File

We can use the function read_csv to read both csv and txt files however as the name suggests it is configured for csv files by default.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
txt_file=pd.read_csv('Book2.txt',
                     header=None)

Here the delimiter is taken to be a comma , when it is a tab  in a txt file. As a result there is no instruction to move onto the next column and we have all the data displayed in a single column.

The function read_csv has the keyword argument delimiter which can be used to override the delimiter used to specify a tab  to do this we use \t. There is also an alias for delimiter, separator sep.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
txt_file=pd.read_csv('Book2.txt',
                     header=None,
                     names=cnames,
                     delimiter='\t')

The file is now read as expected.

Reading an Excel XLSX File

An Excel XLSX file and CSV file are very similar when it comes to interaction with an end user.

One of the key differences is the fact that an XLSX file can have multiple sheets.

The function read_excel is similar to read_csv and shares most of the same keyword positional input arguments. An additional keyword input argument sheet_name is available to specify the sheet. By default there is only a single sheet in an Excel file called Sheet1. In python this has the index 0 as zero order indexing is used in python.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
excel_file=pd.read_excel('Book3.xlsx',
                         header=None,
                         names=cnames)

The sheet can also be manually specified by using the keyword input argument sheet_name and providing the string of the sheet name:

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
cnames=['c0','c1']
excel_file=pd.read_excel('Book3.xlsx',
                         sheet_name='Sheet1',
                         header=None,
                         names=cnames)

Handling Missing Data

Let's now look at a spreadsheet which has some cells that have an error for example a division by zero error or have missing data. If we import this as a dataframe these will be replaced by nan denoting not a number.

# %% Perquisites
import numpy as np
import pandas as pd
# %% Read A File
excel_file=pd.read_excel('Book4.xlsx',
                         sheet_name='Sheet1',
                         header=None,
                         names=['c0','c1'],
                         index_col=None)

We can use the pandas functions isna() or notna() to check if a value is or isn't a nan.

pd.isna(excel_file)	
pd.notna(excel_file)

This returns an output Boolean corresponding to the input.

We can also use the method sum, so sum the number of nan values and not nan values.

pd.isna(excel_file).sum()
pd.notna(excel_file).sum()

The method dropna can also be used on a dataframe. We can type it with open parenthesis to see the keyword input arguments:

excel_file.dropna(

We can see that by default axis=0, how='any' and inplace=False. This means it will act on indexes (rows) and remove any indexes (rows) that have any nan value and this will be saved as an output matrix.

excel_file.dropna(how='any')

On the other hand, it can be changed to how='all' which will only remove indexes (rows) where all values are missing.

excel_file.dropna(how='all')

It is also possible to specify a subset of pandas series (columns) to remove rows from which have 'any' or 'all' nan values. For instance, if we were only interested in removing indexes that had a nan value in the panda series 'c1' we would assign the keyword subset to a single element vector.

excel_file.dropna(how='any',subset=['c1'])

The method fillna can be used to fill any missing values, we can once again input it with open parenthesis to see the keyword input arguments.

We can fill in the data with a number, say for example the number 0.

excel_file.fillna(0)

Alternatively, we can use a dictionary to fill each column with a specific value for instance replacing every missing value in pandas series 'c0' with 0 and panda series 'c1' with 99.

fill_dict={'c0':0,'c1':99}
excel_file.fillna(value=fill_dict)

Which may be particularly useful if each pandas series is a different dtype for example if you have a dataframe where a series that is numeric, a series that is Boolean and a series that consists of strings it will make more sense to have a default value for the nan values is required to suit each series.

The keyword input argument method can be set to forward fill 'ffill' or backfill 'bfill' where each missing value will take the previous non-missing value or next non-missing value respectively.

excel_file.fillna(method='ffill')

Forward fill requires a previous number. If we replace the value on the zeroth index of series 'c1' with None we will see it is unable to replace the nan values on the first two indexes of 'c1'.

excel_file.at[0,'c1']=None
excel_file.fillna(method='ffill')

The pandas dataframe also has the method interpolate which will only work on numeric values and uses the same code as the scipy interpolate under the hood. We can type it in with open parenthesis to view the keyword input arguments:

We can try uses default linear interpolation where all keyword input arguments are left at their default values:

excel_file.interpolate()

And in the case of our very simple dataset the values are interpolated correctly.

Writing a dataframe to File

For convenience, we will create a dataframe of randomly generated numbers. We will use the matrix of randomly generated integers as the data keyword input argument and the list of columns as the columns input argument of the pandas class DataFrame. For reproduce ability the random seed will be set to 0 using the function seed with an input argument of 0.

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
# Create DataFrame
cnames=['c0','c 1','c_2','C 4','5c']
df=pd.DataFrame(data=matrix,columns=cnames)

It is possible to save the dataframe to an excel, csv or txt file by using the dataframe methods to_excel and to_csv. If we type in the name of the dataframe df followed by a dot . and then to_excel with open parenthesis we can see details about the positional and keyword input arguments.

For a simple file with a single sheet the positional input argument excel_writer can be the file name of the XSLX file including the extension. The keyword input argument sheet_name has a default value of the string 'Sheet1' and can be changed if desired.

df.to_excel('Book4.xlsx')

Note the index and the columns display in bold with borders. These can be removed if desired by using the keyword input arguments index and header and assigning them to None (or False).

df.to_excel('Book4.xlsx',index=None,header=None)

In this case I got a permission error because the file was still open in Excel:

Closing the file in Excel and retrying the line of code works:

It is also possible to use the keyword input arguments startrow and startcol to add blank rows or columns.

df.to_excel('Book4.xlsx',startrow=3,startcol=2)

To write a multiple sheet object it is slightly more complicated. Using the procedure above with different sheet names will simply overwrite the file each time opposed to appending a sheet.

We must create an instance of the ExcelWriter class (like the DataFrame class this is CamelCaseCapitalized). This class has a keyword input argument path which is the excel file name to be created. This also acts as a positional input argument. All other keyword input arguments can be left as their default values.

Let's update the code to create 3 dataframes df, df2 and df3 of random data. Instead of specifying the file name in each sheet, we use the writer instead. Finally the writer is an object and we must use its method save to save the file.

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
matrix2=random.randint(low=0,high=11,size=(12,5))
matrix3=random.randint(low=0,high=11,size=(12,5))
# Create DataFrames
cnames=['c0','c 1','c_2','C 4','5c']
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix2,columns=cnames)
df3=pd.DataFrame(data=matrix3,columns=cnames)
# Create Writer Class
writer=pd.ExcelWriter(path='Book5.xlsx')
# Assign Each Sheet
df.to_excel(writer,sheet_name='Sheet1')
df2.to_excel(writer,sheet_name='Sheet2')
df2.to_excel(writer,sheet_name='Sheet3')
# Save to File
writer.save()

The to_csv dataframe method can be used to write to a csv or txt file and has a lot more keyword arguments related to formatting. These keyword input arguments default to that of a standard csv file. It also shares many of the keyword input arguments previously used in to_excel although many of these don't show in the dialog as the formatting options are more commonly used.

Using the defaults we get a csv file:

df.to_csv('Book6.csv')

Using a txt extension will give us a comma delimited file with a txt extension instead of a tab delimited text file:

df.to_csv('Book7.txt')

We can specify the tab as a delimiter using the keyword input argument sep (unlike read_csv, there is no alias delimiter).

df.to_csv('Book7.txt',sep='\t')

Attributes and Methods

The dataframe is an object and possesses attributes and methods. These can be accessed by typing in the dataframes name followed by a . and then a . The list of these can already be scrolled through. We have already used the to_csv and to_excel methods.

When beginning with pandas and indeed python programming there can be a lot of confusion between attributes and methods.

An attribute can be thought of as a property of the dataframe for example the size and shape of the dataframe.

df.size
df.shape

These are called without parenthesis.

The attribute however is also an object and may have attributes of its own (nested attribute). For example:

df.size.size

df.size gives the integer of 60 (i.e. 60 elements in the dataframe df) and the integer of 60 has a size of 1 (i.e. it a scalar and thus has a single element).

We can see that an attribute is an object when we assign it to a variable name:

a=df.size

The index, columns, values properties of the dataframe are also attributes:

df.values
df.columns
df.index

An attribute can also be thought of as an object that is referenced via another object. A method on the other hand can be thought of as a function that is referenced via another object. Because it is a function it must have parenthesis.

When getting started with pandas there may be some initial confusion between attributes and methods. The method head might initially be confused as an attribute. When typed into the console without parenthesis the output informs the user that it is a method.

Therefore it should be called with parenthesis.

df.head()

This gives the first 5 indexes (0 to 5 but not including 5).

This may still look like just another property of the dataframe df. Examining it in more detail we can see it is a method and we can see details about the keyword input argument n.

Essentially we use this method which acts as a function on the dataframe using the specified keyword input argument n to read off the first n indexes. i.e. using the method works on the original dataframe to create a new dataframe opposed to just reading off a property of the original dataframe.

We can use n as either a keyword or positional input argument in this case and override the default value of 5:

df.head(n=3)
df.head(2)

The method tail is similar and looks at the number of n indexes from the bottom of the dataframe. Once again n has the default value of 5.

We can use the attribute dtypes to return the dtype of each column and dataframe itself:

df.dtypes

If this attribute is called with open parenthesis, no details about input arguments are given which hints it is an attribute and not a method:

In this case they are all int32 as expected.

It is possible to get descriptive statistics on each series using the describe method.

df.describe()

When called with open parenthesis, details about the keyword input arguments display.

In this case they will all be left as their default values:

Column Selection

Square Bracket Indexing

Recall that when we index into a list, we type in the list name followed by the numerical index we want to index which is enclosed in square brackets.

lt=[10,20,30,40,50]
print(lt[1])

Likewise, for a dictionary we use the dictionary name followed by the key as a string which is also enclosed in square brackets.

di={'a':10,
    'b':20,
    'c':30,
    'd':40,
    'e':50}
di['b']

We can also follow this convention in a pandas dataframe by use of the pandas column name.

Like a key in the dictionary these names are input as strings and are case sensitive. Let's have a look at the pandas column 'c0':

df['c0']

When we assign this to a variable name, we see that it shows up as a pandas series within the variable explorer.

col0=df['c0']

Attribute Dot Indexing

If the column name follows the same rules as variable names, it is automatically added as an attribute to the dataframe. Recall that the attributes can be viewed by typing in the dataframe name followed by a dot . and then a tab .

In this case only the columns 'c0' and 'c_2' follow the rules. These can be selected as an attribute of the dataframe:

df.c0
df.c_2

As attributes are not strings there are no quotations used when using attribute . indexing.

Columns named without following the rules behind variable names will not have an attribute assigned to them when they are created. If the dataframe name is typed in followed by a dot . and then a tab then only c0 and c_2 display as attributes as all the other series are incorrectly named.

These columns can only be accessed using square brackets:

df['c 1']

Adding a New Column

One other nuance to keep in mind when comparing column selection using square brackets and attribute dot notation is that attribute names can only be used on columns that already exist. In other words when a column is created, an attribute is created immediately after (if the column name follows the rules behind variable names) the column is created. This means that attribute dot notation cannot be used to create a new attribute but can only be used to access an existing attribute once it is created.

We can see this if we try and generate a new column in the dataframe:

df['c6']=random.randint(low=0,high=11,size=12)

If on the other hand, we try:

df.c7=random.randint(low=0,high=11,size=12)

We get the pandas user error UserWarning: Pandas doesn't allow columns to be created via a new attribute name.

Relabelling Indexes using a Dictionary

We can rename individual column names or row names by use of the rename method. Its keyword input argument expects the form of a dictionary.

Recall that a dictionary has the following form. It is enclosed in braces {} and each key is paired with a corresponding value by use of a colon : and finally the comma , is used as delimiter to move onto the next key value pair.

d1={'key0':'value0',
    'key1':'value1',
    'key2':'value2'}

This basic dictionary can be written on a single line:

d1={'key0':'value0','key1':'value1','key2':'value2'}

When using a dictionary to rename columns the original name is specified as the key and the new value is specified as the value.

df.rename(columns = {'oldcol0': 'newcol0', 'oldcol1':'newcol1'})

Supposing we wish to rename 'c 1' to 'c1', 'c_2' to 'c2' and 'C 4' to 'c4' we can use:

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'})

Notice however that a new dataframe is printed to the console and the original dataframe remains unchanged. This can be addressed by manually reassigning the dataframe created to the original dataframe name:

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'})

We can type in this method with open parenthesis to view the keyword input arguments. We can see that the inplace keyword input argument has a default value of False.

We can override this with the value of True.

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},inplace=True)

Notice that in doing so the dataframe is updated but there is nothing printed to the console.

This means that we assign the value to df, df will have the value NoneType.

df=df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},inplace=True)

In other words the following lines of code will update the dataframe as intended:

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},inplace=True)
df=df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'})

However the following lines of code will not:

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'})
df=df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},inplace=True)

It is possible to rename an index using the same method with the keyword input argument index:

df.rename(index={0:'r0',1:'r1',2:'r2'},inplace=True)

Reordering Columns

Supposing column '5c' was actually meant to be 'c3' we can reorder the column names by creating a list of the new order of the series names and indexing to the dataframe with it. We can assign the output to a new dataframe or alternatively if we wanted reassign it to the original dataframe:

neworder=['c0','c1','c2','5c','c4','c6']
df2=df[neworder]

This method can also be used to create a smaller dataframe with only a subset of the columns.

Relabelling Columns or Indexes using a For Loop

Now supposing we want to rename all the columns. We can use the dictionary method as earlier. However, for convenience we may want to use a for loop to create a list of string names.

Let's have a look at the number of columns in the dataframe which can be accessed by assigning the attribute df.shape to a tuple.

(nrows,ncols)=df2.shape

We can then use a for loop with the ncols to generate a list of numbers

[i for i in range(ncols)]

Which creates a list of numbers. Let's assign it to a variable cnames so we can view it in the variable explorer:

cnames=[i for i in range(ncols)]

We can see that everything is numeric however we want strings so we can convert i to the str class.

cnames=[str(i) for i in range(ncols)]

Then we can perform concatenation to the str to get a list of column names:

cnames=['C'+str(i) for i in range(ncols)]

We can then reassign the attribute df2.columns to this new list.

df2.columns=cnames

The new attributes can be accessed using a dot . and then a tab .

If we wanted to also relabel the rows using a for loop we can add the following 2 lines:

rnames=['r'+str(i) for i in range(nrows)]
df2.index=rnames

The Series Class

If we take a look at the dataframe df2 we see that it displays on the variable explorer and has the type DataFrame.

If we select a column and assign it to a variable name:

col1=df2.c1

We can see that the column has a type of Series and not DataFrame, note the capitalization in both as it refers to their class. col1 (the variable name) is an instance of the Series class and displays as a single column. The column name is 'c1'. The series possesses an index and a column vector of data corresponding to the index.

A series can be created manually by creating an instance of the Series class. The keyword input argument data can be input in the form of a dictionary in which case the key is the index and the value is the value given. Alternatively the data may be the value in the form of a list and index may be input separately as a list using the keyword input argument index. The keyword input argument name can be used to specify a name for the index and the keyword input argument dtype can be used to set the datatype of the series.

Let's look at creating a series using both a dictionary and lists

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create a Series
data0={'r0':2,'r1':4,'r2':6}
data1=[4,8,12]
index1=['r0','r1','r2']
col0=pd.Series(data=data0,name='c0')
col1=pd.Series(data=data1,index=index1,name='c1')

It is much rarer to see a dataframe with a single column however it is sometimes used as the first step to create a dataframe from a series. For example:

df=pd.DataFrame(col0)

Each additional series can then be added as a column for example:

df['c1']=col1

The DataFrame and Series classes have a different but similar list of attributes and methods available to them.

An analogy to core Python is a str and a int. These can be created shorthand using:

a=1
b='a'
c=1.5

Or longhand using the class name:

a=int(1)
b=str('a')
c=float(1.5)

The list of attributes and methods can be accessed by typing in the object name followed by a dot . and then tab . In this case we can see the attributes and method to the int object a are numeric and for the str object b are string related. A float object would have slightly different

c is a float which is also numeric but more complicated than an int. As a result the list of attributes and methods available to the float are similar to the int but a bit more complicated:

Returning to col0 an instance of the Series class, we can have a look at the attributes and methods by typing in the object name followed by a dot . and then tab . The row names are attributes of the column for example:

A series attribute can be selected from a dataframe and then the list of attributes and methods accessed in the same way by use of a dot . and then tab :

Using:

df.c1.r0

Returns the value of the cell 4. c1 is a series attribute of the dataframe f1 corresponding to the column c1. r0 is an int attribute of the series c1 corresponding to the row r0:

The Index Class

Returning to the code above and modifying the text of cnames to capitalize C.

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
# Create DataFrames
cnames=['C'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,
                 index=rnames)

We can look at interacting with the column attribute of df2.

df2.columns

When input into the console, it displays the type Index.

We can see if we assign this a variable name:

cnames2=df2.columns

That it has the class Index. There may be some confusion here as this a list of column names and not of index (row) names. The class Index can be used for both index (row) names and column names but is far more commonly used and referenced for index (row) names in line with the name of the class.

This is similar to a list however is a different class and therefore will have differing attributes and methods. We can access these by typing in the object name followed by a dot . and then tab .

We can have a look at creating an instance of the Index class manually. Examining the input arguments we can see analogies to the Series class which included an index and column which could b input as key value pairs in a dictionary or using two separate lists. The Index class only has an index.

We can create a simple index from a list:

idx=pd.Index(['a','b','c'])

Note the Index class displays in the variable explorer as a single column with a number of rows, as the Index class is most commonly used for row names as previously discussed.

Returning to the index cnames2 we can select the attribute str to access string specific methods available. Once again we can access these using a dot . and tab :

In this case we can use the method lower to remove the capitalization. Because the output was not reassigned to the object name, no changes were made and cnames2 is still capitalized:

cnames2=cnames2.str.lower()

To update the column names in the dataframe to lower case we can therefore use:

df2.columns=df2.columns.str.lower()

Sometimes (particularly when importing data from Excel) we may wish to keep the original names but remove any spaces so we get attributes for attribute indexing. We can modify the code to emulate a dataframe with names that include a space.

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
# Create DataFrame
cnames=['c0','c 1','c_2','C 4','5c']
df=pd.DataFrame(data=matrix,columns=cnames)
df['c6']=random.randint(low=0,high=11,size=12)
df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},inplace=True)
neworder=['c0','c1','c2','5c','c4','c6']
df2=df[neworder]
(nrows,ncols)=df2.shape
cnames=['c '+str(i) for i in range(ncols)]
df2.columns=cnames

Instead of using the str method lower, we can use the str method replace instead. If we call this using open parenthesis we see 2 keyword input arguments pat, the pattern to be replaced and repl the replacement for the pattern.

We can also select the str attribute and then use the method replace to replace a repeating pattern in this case a space with nothing.

df2.columns=df2.columns.str.replace(pat=' ',repl='')

Alternatively in an underscore is required we can use:

df2.columns=df2.columns.str.replace(pat='c',repl='c_')

Adding a Row

A new row can be appended to the bottom of a dataframe using the method append. In this case the first input argument to append, the new row must be in the form of a dictionary or a pandas series.

When using a dictionary each of its keywords should correspond to the names of the columns in the dataframe and the values should be the values to be appended.

new_row={'c_0':9,'c_1':9,'c_2':9,'c_3':9,'c_4':9,'c_5':9}

When a dictionary is used, the keyword input argument ignore_index should be set to True. With this setting a new integer value will be created and the column will be appended to the bottom of the dataframe.

new_row={'c_0':9,'c_1':9,'c_2':9,'c_3':9,'c_4':9,'c_5':9}
df2.append(new_row,ignore_index=True)

Note this method doesn't update the dataframe but prints the output to the console. Unlike the method rename, this method does not have the keyword input argument inplace and therefore one must manually reassign the output to the original dataframe for an inplace update.

df2=df2.append(new_row,ignore_index=True)

Alternatively one can create an instance of the Series class using the dictionary:

The series can be created by assigning the dictionary new_row to the data positional input argument and the keyword input argument name to 12 (which will be the value of the index displayed in the dataframe).

new_row={'c_0':8,'c_1':8,'c_2':8,'c_3':8,'c_4':8,'c_5':8}
new_series=pd.Series(data=new_row,name=12)

Note the Series class can be a column or row but is most commonly used for columns. When opened in the variable explorer it therefore displaying as a column opposed to a row:

It will however append as a row. Using a series for appending opposed to a dictionary is usually only done if a string is desired for the index name:

df2=df2.append(new_series)

Note if the keyword input argument name is not provided when creating the series, the series will act in the same manner as the dictionary and the keyword input argument ignore_index must be set to True when using the append method.

A series is either a column or an index. To compare we can assign column0 to a variable name which also gives a series:

col0=df2['c_0']

Deleting a Column or Row

We can delete a column by using the method drop.

This method can be used with a label keyword input argument and axis keyword input argument where the label is the name of the row or column and axis works similar to many of the numpy functions where axis=0 acts on indexes and axis=1 acts on series. If used this way both the label and axis need to match. Alternatively it is simpler to use the index and columns keyword input arguments as they are a bit more intuitive.

The output is printed to the console. This method also has an inplace keyword argument which is set to False by default similar to the dataframe method rename which we used earlier.

df2.drop(index=10,columns=['c_0','c_2'],inplace=True)

In my case however this gives a SettingWithCopyWarning suggesting this is not properly implemented.

Reassigning the dataframe works without any issues

df2=df2.drop(index=10,columns=['c_0','c_2'])

Indexing

Let's simply the code above:

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
# Create DataFrames
cnames=['c0','c1','c2','c3','c4']
df=pd.DataFrame(data=matrix,columns=cnames)

Here we have numeric indexes.

Indexing is more commonly carried out using the column using square brackets or attribute indexing:

df['c0']
df.c0

A value can be selected by further indexing into the column. Because these indexes are numeric only square bracket indexing will work as there are no attributes created for the index names. For example:

df['c0'][1]
df.c0[1]

We can also create a dataframe df2 with string indexes.

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
# Create DataFrames
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,
                 index=rnames)

Now we can use both square bracket and attribute indexing of a column to get a value in a cell:

df2['c0']['r1']
df2.c0['r1']
df2['c0'].r1
df2.c0.r1

It is also possible to index using an index first using the attributes loc and iloc.

When the attribute alone is selected:

df.iloc

The value returned informs the user that it is used for indexing:

iloc uses integer indexing and an integer must be provided in square brackets.

df.iloc[1]

The value returned is a pandas series which is a vector with a single dimension. It isn't assigned to a new variable so it is displayed as alongside the column names in the console.

Series display as columns by default.

This can be seen if it is assigned to a variable name:

row1=df.iloc[1]

loc uses the index name which can be a string or number. In the example above the indexes are numerical a number so both iloc and loc appear to behave identically.

row1=df.loc[1]

The difference can be seen more clearly when they are used with df2:

df2.iloc[3]
df2.loc['r3']

Note that iloc can be used when index names are applied and will correspond to the numeric order of the indexes. It can be used to select multiple rows using slicing which follows the same rules as slicing in a list or tuple:

df.iloc[3:5]

Recall that is using 0 order indexing so the slice 3:5 starts from 3 and goes up to 5 in steps of 1 but never reaches 5 itself. In other words the slice is inclusive of the lower bound and exclusive of the upper bound. When a slice containing multiple indexes is selected, the output will display as a dataframe (matrix) opposed to a series (vector):

Because df2.iloc[3:5] is essentially a dataframe, it can be indexed as normal:

df2.iloc[3:5].c1
df2.iloc[3:5]['c1']

A single row becomes a series and can also be indexed into using the column name:

df2.iloc[3].c1
df2.iloc[3]['c1']

When indexing a cell we have either selecting a column attribute and then row or a row attribute and then the column. This means we are selecting an attribute of an attribute i.e. a selection of a selection and are not selecting the index directly.

We can use the attribute iloc, loc, iat or at to index a cell directly by specifying the index and column within a single square bracket.

df2.loc['r2','c1']
df2.iloc[2,1]
df2.at['r2','c1']

This will also work when the index is a numeric integer:

df.at[2,'c1']

at and iat may at first glance appear to be identical to loc and iloc respectively. Note however that at or integer at iat always requires both the index and column to be specified within the square brackets. loc and iloc are therefore more routinely used to select an index while at and iat are more routinely used to select a cell. Using at and iat is also preferred for updating the value of a cell.

df2.at['r2','c1']=12

Be careful of the dtype in a dataframe when updating the data:

df2.iat[2,1]=3.1

In the example above the float 3.1 was used however the dtypes of the columns are int32 and as a result only the integer part of the float was accepted. This can be seen by using the pandas series attribute dtype to check the datatype of the column:

df2.c1.dtype

The pandas series method astype can be used to change the datatype to a float:

 df2.c1.astype(float)

Note each value now has a decimal point. The output is outputted to the console and the dataframe is not updated.

To update we must reassign the output to the column:

df2.c1=df2.c1.astype(float)

At first glance there appears to be no change to the dataframe:

However when we attempt to add the float again it works as expected. We can also use the dataframe attributes dtypes and see that c1 is now has a float datatype:

Sorting Data

We can sort a dataframe or a pandas series by using the method sort_values. Let's use the dataframe attribute c1 to get the series.

df.c1

Now we can type the method sort_values with open parenthesis to see the input arguments. The keyword input arguments of most interest are the axis, ascending and inplace keyword input arguments.

The axis input argument works in the same way as many of the numpy arrays methods. Because the method is being called from the series df2.c1 there is only a single axis 0 so this can be left as default.

ascending is set to True meaning the series will be sorted from lower to higher values.

inplace is set to False meaning the original column won't be replaced and the output will instead be displayed in the console.

Therefore leaving these three keyword input arguments as their default values gives:

The attribute loc can be used instead to select an index to sort:

df2.loc['r1'].sort_values()

If we attempt to set the keyword inplace to True we get an error ValueError: This Series is a view of some other array, to sort in-place you must create a copy. This is because we are looking at an attribute of the dataframe and not the dataframe itself.

df2.c1.sort_values(inplace=True)

When using sort_values directly from a dataframe, an additional positional input argument by displays.

by can be the column name in the form of a string if data is sorted only by a single column:

Note that 'c1' contains two 0 values, three 3 values and three 6 values. We can provide secondary sorting and tertiary of the duplicates by proving a list with additional columns.

df2.sort_values(by=['c1','c2','c3'],inplace=True)

To instead sort the columns using a index. We need to change the keyword input argument axis to 1 and then can use the keyword input argument by to select an index.

df2.sort_values(by='r1',axis=1,inplace=True)

When indexes are not assigned to strings by will work the numeric indexes.

df.sort_values(by=[0,1],axis=1,inplace=True)

Sorting by reverse order:

df.sort_values(by=[0,1],axis=1,ascending=False,inplace=True)

To return to default i.e. sorted by the column name or index name there is the additional method sort_index which once again has the keyword input argument axis which has a default of 0 meaning it will sort rows by their index in line with the methods name. This method does not have the keyword input argument by as it will always use the index names when axis is set to 0 or column names when axis is set to 1.

In this case we can sort the columns out by the column names:

df.sort_index(axis=1,inplace=True)

Filtering

Boolean Values

Filtering is typically carried out by use of a Boolean (True or False) which instruct to show or hide select rows. Let's create a random Boolean array which has the same length as a series in the dataframe df2.

show_index=random.choice([True,False],size=(12))

Now we can open up the dataframe df2 and the Boolean list show_index in the variable explorer side by side.

We can use the list show_index to index into the dataframe df2 by use of the Boolean.

df2[show_index]

Indexing into a dataframe by the Boolean list will create a new dataframe that contains only the indexes which were True in the Boolean list.

1,3,4 and 6 are True in the show_index list and as a result only these indexes are selected.

This new dataframe is not assigned to an output variable name and only previewed in the console. We can reassign the dataframe variable name to update the dataframe or create a new dataframe in this case df3:

df3=df2[show_index]

To instead filter columns we need to create a list of Bools which are the same size as the number of columns.

show_index=random.choice([True,False],size=(5))

Then we need to use the attribute iloc with both a row and column index. For the row index we will use : denoting all rows and for the column index we will use the list of Boolean values.

df2.iloc[:,show_index]

Conditional Logic

In the examples above a list of booleans was randomly created to show the principle behind filtering. In most applications a question is asked to generate such a list. For instance we may only be interested in rows which have a value in c1 that is greater than 5. We can select this column using attribute notation and then ask if it is greater than 5 to get a comparable boolean list.

df2.c1>5

We can then assign it to the variable bool_list and as before index using it:

bool_list=df2.c1>5
df2[bool_list]

Note that bool_list is a pandas series and not a python list. A panda series by default acts as a column:

Combining these two in one line we can use:

df2[df2.c1>5]

We can use this notation to look at rows which have a value of 6 in the column c1 and assign it to a new dataframe df3. Recall the difference between the == equals to operator and = the assignment operator.

df3=df2[df2.c1==6]

If instead we wanted to select columns which only had a value of 3 in r0. The procedure is slightly more complicated. We can use:

show_index=df2.loc['r0']==3

However notice that the pandas series created displays as a column opposed to a row:

This has a consequence and indexing with it gives us a dimensionality error as it is used as a column and not a row:

We can overcome this by using the method to_list using this method without assignment to a variable will print the list to the console. Alternatively we can reassign the variable name show_index using:

show_index=show_index.to_list()

Recall that a list has a single dimension and will be re-orientated to either be a row or column depending on what is most suitable. i.e. displayes in the variable explorer as a row but when expanded displays as a column. This means the line:

df2.iloc[:,show_index]

Will now work as the dimensionality is correct:

The method to_numpy works in a similar manner as the method to_list but creates a numpy array instead of a list.

show_index=show_index.to_numpy()

Multiple logical statements can be indexed in this way. Each condition has to be enclosed in parenthesis and the symbols and & or or | can be used to combine the conditions.

df2[(df2.c1==3) & (df2.c2==3)]

Instead of filtering we can use conditional logic to ask a question and assign the Boolean output to a new column q.

df['q']=[(df2.c1==3) & (df2.c2==3)]

We can look at the datatypes of the dataframe using the attribute dtypes.