Python and Data Analysis Library (PANDAS)

Introduction

This is a introductory guide to using the python and data analysis (pandas) library to organize and manipulate data in a dataframe (which can be thought of as equivalent to an Excel spreadsheet). The Spyder 4 IDE will be used in this example as it has a variable explorer which makes its easier to visualize the example data.

Importing the Pandas Library

The pandas library is one of the most commonly used python libraries particularly in the field of data science as a consequence it is commonly imported using a 2 letter abbreviation, similar to the numpy library.

import numpy as np
import pandas as pd

DataFrames

Vectors

Supposing we start out with the following data as numpy array vectors.

import numpy as np
x=np.array([0,1,2,3,4])
y1=np.array([0,2,4,6,8])
y2=np.array([0,1.5,3,4.5,6.0])

Matrix

In this case x is the independent axis and y1 and y2 are measurements taken at the values of x. It may be more convenient to group it together as a numpy array matrix. To do this we can explicitly set each numpy array vector to a column and concatenate them:

import numpy as np
x=np.array([0,1,2,3,4])
y1=np.array([0,2,4,6,8])
y2=np.array([0,1.5,3,4.5,6.0])
x=np.reshape(x,(len(x),1))
y1=np.reshape(y1,(len(y1),1))
y2=np.reshape(y2,(len(y2),1))
matrix=np.concatenate([x,y1,y2],axis=1)

Note however when we do so the original names of each data series (columns) in this case x, y1 and y2 are lost.

DataFrame

We can instead create another data structure which groups the data together like a matrix and proscribes a name to each series (column), this is called a dataframe. An excel spreadsheet is very similar to a dataframe.

import numpy as np
x=np.array([0,1,2,3,4])
y1=np.array([0,2,4,6,8])
y2=np.array([0,1.5,3,4.5,6.0])
x=np.reshape(x,(len(x),1))
y1=np.reshape(y1,(len(y1),1))
y2=np.reshape(y2,(len(y2),1))
matrix=np.concatenate([x,y1,y2],axis=1)
import pandas as pd
dataframe=pd.DataFrame(matrix,columns=['x','y1','y2'])

Note that in the left hand side of line 10 we have assigned an instance to the name dataframe. Whereas on the right hand side we are using CamelCaseCapitalization of the word DataFrame because we are calling a third library class.

Attributes

For convenience, we will create a dataframe of randomly generated numbers. We will use the matrix of randomly generated integers as the 0th input argument and the list of columns as the 1st input argument of the pandas class DataFrame. For reproduce ability the random seed will be set to 0 using the function seed with an input argument of 0.

# Perquisites
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
# Create DataFrame
cnames=['c0','c 1','c_2','C 4','5c']
df=pd.DataFrame(data=matrix,columns=cnames)

We can open this dataframe, df up in the variable explorer. df contains five data series or pandas series 'c0', 'c 1', 'c_2', 'C 4' and '5 c' which we also commonly refer to as columns.

Each column is equally sized and has an index of length 12. By default, the index names are numeric integers using zero-order indexing.

A dataframe is an object which contains several attributes. Attributes can be thought of as a variable that is looked up from another object using the dot . syntax. Attributes can be accessed by typing in the name of the dataframe followed by a . and then a .

For example, we may access the attribute columns using:

df.columns

Let's now access the attribute columns and save the output to a variable name:

columns2=df.columns

Here we see that the new variable 'columns2' is listed in the variable explorer and the variable explorer shows that the type is Index.

We can also lookup the attribute shape which gives the dimensions of the dataframe in the form of a tuple with 2 values representing the number of indexes (rows) and the number of series (columns) respectively.

df.shape

In addition, we can use the attribute dtypes to get the type of data present in each panda series (column).

pd.dtypes

In this case they are all int32 as expected.

Methods

Notice that the above attributes are called without use of parenthesis. Attributes should not be confused with methods which are also called up using the dot . notation but require parenthesis usually to place input arguments. To view the list of methods available for a dataframe once again we can type the dataframe name followed by a dot . and then tab . The list generated however is of both attributes and methods, with no clear indicator between them.

This list of both attributes and methods applicable to a dataframe can be printed to the console by using the function dir and specifying the dataframe as its input argument.

dir(df)

The items beginning with the double underscore are special methods which are used when creating custom classes and can be ignored for just now. Again, looking through the list is not obviously clear which is a function and which is a method.

If you type in a dataframe method with an open parenthesis the Spyder IDE will give you details about the positional and keyword input arguments. For example, if we type in:

df.head(

In contrast if we attempt this with an attribute:

df.column(

Nothing will show as the attribute doesn't have input arguments.

Let's have a look at the dataframe method head, as we can see this method has a default keyword argument n which is set to a default value of 5. This method will create a new dataframe that previews the first 5 indexes (or n indexes if the keyword input argument is modified).

df.head()

Alternatively, the method tail can be used to look at the last 5 indexes (or n indexes if the keyword argument is modified).

df.tail()

These could be assigned to new variable names and displayed within the variable explorer if desired by using the assignment operator to a variable name on the left-hand side. Basic statistics about each dataframe series (columns) can be obtained by use of the method describe. Once again if we type in this method with open parenthesis, we can see it has keyword input arguments.

We can call the method describe without modification of any keyword input arguments:

df.describe()

Nested Attributes and Methods

Attributes as mentioned earlier can be thought of as a variable that is looked up from another object using the dot . syntax. Attributes, themselves can have nested attributes and methods. Once again we can access these using a dot . and a tab .

In this case we may use the method to_list to convert the column of type index to a list.

columns3=df.columns.to_list()

Compare columns2 to columns3 in the variable editor.

Indexing a pandas series (column) – Square Bracket Indexing vs Attribute Dot Indexing

Recall that when we index into a list, we type in the list name followed by the numerical index we want to index which is enclosed in square brackets.

lt=[10,20,30,40,50]
print(lt[1])

Likewise, for a dictionary we use the dictionary name followed by the key as a string which is also enclosed in square brackets.

di={'a':10,
    'b':20,
    'c':30,
    'd':40,
    'e':50}
di['b']

We can also follow this convention in a pandas dataframe by use of the pandas series (column) name.

Like a key in the dictionary these names are input as strings and are case sensitive. Let's have a look at the pandas series 'c0':

df['c0']

When we assign this to a variable name, we see that it shows up as a pandas series within the variable explorer.

col0=df['c0']

Recall we could use the method to_list to convert it to a list.

If the series (column) name follows the same rules as variable names, it is automatically added as an attribute to the dataframe. Recall that the attributes can be viewed by typing in the dataframe name followed by a dot . and then a tab .

In this case only the series (columns) 'c0' and 'c_2' follow the rules. These can be selected as an attribute of the dataframe:

df.c0
df.c_2

Note that no quotations are used when using attribute dot . notation.

Attributes can also be indexed. For example, we can select the 5th index of the pandas series (column) by use of square brackets:

df.c0[5]

Nested attributes and methods can be accessed for example:

df.c0.dtypes
df.c0.max()

Series named without following the rules behind variable names will not have an attribute and so cannot be indexed in this way but can be indexed with square brackets. For example 'c 1' which has a space.

df['c 1']

Creating a new pandas series (column)

One other nuance to keep in mind when comparing series selection using square brackets and attribute dot notation is that attribute names can only be used on columns (series) that already exist. In other words when a series is created, an attribute is created if the series name follows the rules behind variable names. As a consequence attribute dot notation cannot be used to create a new attribute but can only be used to access an existing attribute. We can see this if we try and generate a new series (column) in the dataframe:

df['c6']=random.randint(low=0,high=11,size=12)

If on the other hand, we try:

df.c7=random.randint(low=0,high=11,size=12)

We get the pandas user error UserWarning: Pandas doesn't allow columns to be created via a new attribute name.

Renaming a pandas series (column)

We can rename individual columns by use the rename method. Its keyword input argument expects the form of a dictionary.

Recall that a dictionary has the following form. It is enclosed in braces {} and each key is paired with a corresponding value by use of a colon : and finally the comma , is used as delimiter to move onto the next key value pair.

ex_dict={'key0':'value0',
         'key1':'value1',
         'key2':'value2'}

This basic dictionary can be written on a single line:

ex_dict={'key0':'value0','key1':'value1','key2':'value2'}

When using a dictionary to rename columns the original name is specified as the key and the new value is specified as the value.

df.rename(columns = {'oldcol0': 'newcol0', 'oldcol1':'newcol1'})

Supposing we wish to rename 'c 1' to 'c1', 'c_2' to 'c2' and 'C 4' to 'c4' we can use:

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'})

Notice however that a new dataframe is printed to the console and the original dataframe remains unchanged. This can be addressed by manually reassigning the dataframe created to the original dataframe name:

df=df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'})

However this method has a keyword argument inplace which has a default value of False.

We can simply reassign it to True to perform an inplace update of our original dataframe.

df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},inplace=True)

Note if this keyword input argument inplace=True, there is no output argument for this method and a NoneType object is returned. This is the same behaviour as assignment of an output to a custom function which has no return statement.

Once these columns have been renamed. They will show up as attributes because they now follow the rules behind variable names.

Supposing series (column) '5c' was actually meant to be 'c3' we can reorder the series names by creating a list of the new order of the series names and indexing to the dataframe with it. We can assign the output to a new dataframe or reassign it to the original dataframe:

neworder=['c0','c1','c2','5c','c4','c6']
df2=df[neworder]

Now supposing we want to rename all the pandas series (columns). We can use the dictionary method as earlier. However, for convenience we may want to use a for loop to create a list of string names. We can then reassign the attribute df.columns to this new list.

# Perquisites 
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
data=random.randint(low=0,high=11,size=(12,5))
# Create DataFrame
columns=['c0','c 1','c_2','C 4','5c']
df=pd.DataFrame(data=data,columns=columns)
df['c6']=random.randint(low=0,high=11,size=12)
df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},
          inplace=True)
neworder=['c0','c1','c2','5c','c4','c6']
df=df[neworder]
# Create column and index names 
(nrows,ncols)=np.shape(df)
col=np.arange(ncols)
col=['c' + i for i in col.astype(np.str)]
df.columns=col

The for loop here carries out several steps. It uses the method astype to convert from integer values to strings. It takes advantage of the fact that string concatenation combines strings and is also enclosed in square brackets to make a list. We can see what is going on in more detail below.

In many cases we will come across data which contains a space or upper-case letter in its column name. Let's amend the code above to emulate this.

# Perquisites 
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
data=random.randint(low=0,high=11,size=(12,5))
# Create DataFrame
columns=['c0','c 1','c_2','C 4','5c']
df=pd.DataFrame(data=data,columns=columns)
df['c6']=random.randint(low=0,high=11,size=12)
df.rename(columns={'c 1':'c1','c_2':'c2','C 4':'c4'},
          inplace=True)
neworder=['c0','c1','c2','5c','c4','c6']
df=df[neworder]
# Create column and index names 
(nrows,ncols)=np.shape(df)
col=np.arange(ncols)
col=['C ' + i for i in col.astype(np.str)]
df.columns=col
df.columns.str.lower()

However, we will need to reassign the value of df.columns to this new value:

df.columns=df.columns.str.lower()

We can also select the str attribute and then use the method replace to replace a repeating pattern in this case a space with an underscore replacement.

df.columns=df.columns.str.replace(pat=' ',repl='_')

Deleting a pandas series (column) or index (row)

We can delete a column by using the method drop. This method has two additional keyword arguments axis similar to many of the numpy functions we seen earlier where axis=0 acts on indexes (rows) and axis=1 acts on series (columns). This method also has an inplace keyword argument which is set to False by default similar to the dataframe method rename which we used earlier.

df.drop('c_0',axis=1,inplace=True)

Multiple series (columns) can be dropped by inputting a list:

df.drop(['c_3','c_5'],axis=1,inplace=True)

Indexes (rows) can be deleted by using the numerical value of the rows and setting axis=0.

df.drop([3,4],axis=0,inplace=True)

Adding an index (row)

An index (row) can be added using the method append. In this case the first input argument to append, the new index (row) must be a dictionary nested as a list. Each of its keywords should correspond to the names of the series (columns) in the dataframe and the values should be the values to be updated.

new_row=[{'c_1':5,'c_2':6,'c_4':7}]
df.append(new_row,ignore_index=True,sort=None)

Note this method doesn't update the dataframe but prints the output to the console. Unlike the methods drop and rename, this method does not have the keyword input argument inplace and therefore one must manually reassign the output to the original dataframe for an inplace update.

df=df.append(new_row,ignore_index=True,sort=None)

Sorting data

We can sort a pandas series (column) by using the method sort_values(). We can use this method on a column by inputting the dataframe name, then using the column attribute and then call the method sort_values() without an input argument:

df.c_1.sort_values()

This can also be done by selecting the column using square bracket notation:

df['c_1'].sort_values()

The keyword input argument ascending can be reassigned from the default value of True to False to sort the data in descending order. The dataframe method sort_values also has the keyword input arguments axis and inplace which we have seen before. When axis=0 the method sort_values works on all the rows of the selected column. inplace should be set to True when replacing the original column within the dataframe and set to False when previewing the changes for example in the console.

df.c_1.sort_values(ascending=True,inplace=True)

Note because we are sorting the values of an attribute of the dataframe we get the following error if we attempt an inplace update. ValueError: This Series is a view of some other array, to sort in-place you must create a copy. To get around this we should instead use the method sort_values directly on the dataframe.

df.sort_values(['c_1'])
df.sort_values(['c_1'] ,inplace=True)

Note that although the dataframe df is sorted, that the original indexes remain.

We can sort using multiple pandas series (columns) by passing in a list of the panda series (column) names. In this case let’s create a list sort_order which contains the pandas series (column) names we wish to sort by called sort_order. The pandas series (column) name that is the 0th index of the vector sort_order is used to sort the data in df in this case 'c_2'. When two indexes in 'c_2' are the same value, the values in 'c_4' are then sorted in order.

sort_order=['c_2','c_4']
df.sort_values(sort_order,inplace=True)

This of course could be combined into one line:

df.sort_values(['c_2','c_4'],inplace=True)

If we restart the kernel and then relaunch the script, we can try this:

We can also select the data by an index and then reorder the data in the columns. For the dataframe df we must select the index by its numeric value. Once again axis=1 means the operation is carried out on the columns (of the rows).

df.sort_values(7,axis=1)

Filtering

Filtering is typically carried out by use of a Boolean. Let's create a Boolean array which has the same length as a series in the dataframe df.

show_index=[True,False,True,False,True,False,True,False,True,False,True]

For convenience let's restart the kernel and run the kernel so the indexes are in numerical order. Now we can open up the dataframe df and the Boolean list show_index in the variable explorer side by side. Now we can index into the dataframe df by use of the Boolean.

df[show_index]

Indexing into a dataframe by the Boolean list will create a new dataframe that contains only the indexes which were True in the Boolean list.

This new dataframe is not assigned to an output variable name and only previewed in the console. We can save it to a new variable name, in this case df2:

df2=df[show_index]

In the above case we have made a filter selection by manually creating a Boolean vector. Quite often this Boolean vector is created by examining a condition applicable for example to one of the pandas series (columns). For example, we can check each value of the series 'c_1' and assign a True statement only if is equal to the value of 3.

condition=df.c_1==3

Note that condition shows up on the Variable explorer as a pandas series.

This can then be used to index into the dataframe df to filter by this condition.

df[condition]

This can be combined in one step by using:

df[df.c_1==3]

Multiple logical statements can be indexed in this way. Each condition has to be enclosed in parenthesis and the symbols and & or or | can be used to combine the conditions.

df[(df.c_1==3) & (df.c_2<5)]

So far, we have only looked at dataframes containing a panda series (column) of numerical values. However, we can see that the variable condition is a panda series of Boolean values. We can add this to the dataframe df. Recall that when assigning we need to use square bracket indexing to create a new panda series within an existing dataframe.

df['condition']=condition

We can then use the attribute dtypes to see the datatypes of each pandas series.

df.dtypes

Here we see that the pandas series 'condition' added to the dataframe is of type bool (Boolean).

Categorical

We can prescribe the numeric value in 'c_1' to a grade where a value less than 4 is low grade, a value of greater than or equal to 4 to greater than of equal to 6 is of medium grade and a value of greater than 6 is of high grade.

# condition for low
low=df.c_1<4
# condition for med
med=(df.c_1>=4) & (df.c_1<=6)
# condition for high
high=df.c_1>6

In order to use these as conditions within an if, elif and if statement nested within a for loop we need to convert the panda series (columns) generated to lists.

# condition for low
low=df.c_1<4
low=low.to_list()
# condition for med
med=(df.c_1>=4) & (df.c_1<=6)
med=med.to_list()
# condition for high
high=df.c_1>6
high=high.to_list()

Now we can use the method df.shape to get the number of Indexes (rows) nrow and number of pandas series (columns) ncol. We can then initialise a list of strings called grades by taking advantage of string concatenation.

# condition for low
low=df.c_1<4
low=low.to_list()
# condition for med
med=(df.c_1>=4) & (df.c_1<=6)
med=med.to_list()
# condition for high
high=df.c_1>=7
high=high.to_list()

(nrow,ncol)=df.shape
grade=['str']*nrow

Note the script conditions.py has an error every time the dataframe df is called. In this case the script conditions.py requires the separate script dataframes.py to be ran in advance to generate the dataframe df. In our case df is in the console namespace and spyder's general settings are set to run scripts in the consoles mainspace opposed to an empty one so this script will execute. For convenience we will restart the kernel, then run dataframes.py.

Now we can recreate the for loop with nested if, elif and if statement. To recap we can look at the for loop and the if, elif and if statement individually:

for i in range(nrow):
    print(f'index {i}')
if low[0]==True:
    grade[0]='low'
elif med[0]==True:
    grade[0]='med'
else:
    grade[0]='high'

print(f'grade[0]={grade[0]}')

Finally, we can create the nest the if, elif, else statement within the for loop to create the list grade which we can add as a new series to the dataframe df.

# condition for low
low=df.c_1<4
low=low.to_list()
# condition for med
med=(df.c_1>=4) & (df.c_1<=6)
med=med.to_list()
# condition for high
high=df.c_1>=7
high=high.to_list()

(nrow,ncol)=df.shape
grade=['str']*nrow
for i in range(nrow):
    if low[i]==True:
        grade[i]='low'
    elif med[i]==True:
        grade[i]='med'
    else:
        grade[i]='high'

df['grade']=grade

If we look at the attribute dtypes we can the the pandas series (column) 'grade' has a dtype of object.

This can be converted into a category using the method astype with the input argument being the string 'category'.

df.grade.astype('category')

Notice that this outputs a panda seriess (column) and doesn't update the original panda series (column). In addition, the categories are listed in alphabetical order and not by 'low', 'med', 'high' like we might expect.

We need to perform an inplace upgrade of the pandas series (column):

df.grade=df.grade.astype('category')
df.dtypes

Now we will see the datatype of the pandas series 'grade' is updated to category.

Ordering Categories

Once the datatype is a category we type in the DataFrame df, select the categorical column grade as an attribute, and then select the cat attribute and then dot . and tab to access a number of categorical attributes and methods.

We can use the method reorder_categories to specify the new order:

df.grade.cat.reorder_categories(['low','med','high'])

Notice once again that this outputs a pandas series (column) and doesn't update the original pandas series (column). We need to perform an in-place update:

df.grade=df.grade.cat.reorder_categories(['low','med','high'])

The cut function

The procedure used above was quite a cumbersome way to create ordered categories and can be created with the function cut. This function has 2 positional input arguments, x which is the column to be cut and bins which is the number of bins we want to cut the column up into.

Let's use it to create 3 bins:

pd.cut(df.c_1,bins=3)

The three slices (-0.009, 3.0], (3.0, 6.0] and (6.0,9.0] are created. Note the ( corresponds to greater than the lower bound and the ] corresponds to less than or equal to the top bound.

The function will return a pandas series (column) which in this case isn't assigned to a new name so just displays in the console. It isn't quite what we want and instead of using a scalar to define the number of bins we can instead supply a vector. For example, if we supply the vector [-1,4,6,10] then the boundaries will be (-1,4],(4,7],(4,7] and (6,10] respectively corresponding to the values of the boundaries we created earlier when we used the if, elif and else statement nested within a for loop. We can also assign this to a new panda series within the dataframe.

Notice that these all line up with our 'low', 'med' and 'high' categorical strings.

df['grade2']=pd.cut(df.c_1,bins=[-1,4,6,10])

Rename Categories

Note the default names are of the upper and lower boundaries. We can rename them to correspond to 'low', 'med' and 'high' by creating a list of the new category names and the category method rename_categories:

new_cat_names=['low','med','high']
df.grade2.cat.rename_categories(new_cat_names)

Once again, this function will return a pandas series (column) which in this case isn't assigned to a new name so just displays in the console.

We can reassign the original column to these new values for an inplace update:

df.grade2=df.grade2.cat.rename_categories(new_cat_names)

The categories being renamed above were ordered categories as they were created from numerical values, so it was more convenient to rename them by use of a list. For unordered categories or in scenarios where one or two categories are to be renamed it may be more convenient to use a dictionary where the key is the old name and the value is the new name. In this case we can demonstrate this by merely just changing the case of the category.

cat_name_dict={'low':'Low','med':'Med','high':'High'}
df.grade2.cat.rename_categories(cat_name_dict)

Adding Categories

We may wish to scrap all the indexes (rows) in the series (column) 'c_1' which correspond to 0 and update the corresponding indexes (rows) in the series (column) 'grade2' to a new category 'scrap'.

df.grade2[df.c_1==0]='scrap'

Attempting to do so directly will result in a ValueError: Cannot setitem on a Categorical with a new category, set the categories first.

This is because we must create the new category before we can reassign it. Things are done in this manner usually to prevent one from creating an accidental category e.g. 'Low' and 'low' by using capitalization or a punctuation error such a typo or addition of a space. To create a new category we can select

df.grade2.cat.add_categories('scrap')

Once again, this function will return a pandas series (column) which in this case isn't assigned to a new name so just displays in the console. However we also note here the order of the categories, the new category 'scrap' is seen as a higher category of the series 'grade2' then the grade high which is clearly wrong.

We can reassign this to the original pandas series (column) 'scrap2'. Then we can reorder the categories. Finally, we can select the indexes where the panda series 'c_1' is 0 in the panda series 'grade2' and set them to the category 'scrap' which works in this case because the category 'scrap' exists.

df.grade2=df.grade2.cat.add_categories('scrap')
df.grade2=df.grade2.cat.reorder_categories(['scrap','low','med','high'])
df.grade2[df.c_1==0]='scrap'

Selecting by Categories

Once we have our categories, we can use conditional logic to index by them. For example:

df[df.grade2=='low']

Once again, the output dataframe isn't assigned to a variable name and just displays in the console. Because the categories are ordered we can also index using less than <, less than or equal to <=, greater than or equal to >= and greater than >.

df2=df[df.grade2<='low']

Grouping by Category

It is also possible to use the method groupby to group data by the categories in a selected column. A column can then be selected as an attribute to apply the groupby method to and an additional method such as count, mean, std can be used to look at the properties of each group. For example, we can count the number of values in the series 'c_1' where 'grade2' corresponds to each category.

df.groupby('grade2').c_1.count()

We might also want to see if there is a relation between the categories in 'grade2' and the mean value of each category in the series 'c_2' and 'c_4':

df.groupby('grade2').c_2.mean()
df.groupby('grade2').c_4.mean()

Selecting an Index

We have looked at selecting pandas series (columns). It is also possible to select a numerically labelled index (row) by use of the method integer location iloc which is followed by the desired index (row) enclosed in square brackets.

df.iloc[5]

In this case we have not assigned the selection to an output variable so only see the results in the console. If instead we assign it to a variable name, the Type with be a pandas series.

row5=df.iloc[5]

Multiple indexes can be selected by indexing a slice. For example:

selection=df.iloc[4:9]

Which will give the output dataframe selection. Recall that zero-order indexing is applied so we go up to the end value but don't include it when slicing.

Selecting a Cell

Although it is technically possible to select a cell by using either:

df.c_4[2]
df.iloc[2].c_4

It is not recommended to use the expressions above for reassignment of the value of a cell. For instance, if we type in the following expression:

df.iloc[2].c_4=16

Although the desired changes appears to work we get a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Instead it is recommended to use the method at where both the index (row) and then series (columns) can be specified when enclosed in square brackets. For example we can reassign the value at the 5th index (row) and series 'c_1' (column):

df.at[5,'c_1']=8

Renaming Indexes

Earlier we seen that we can rename the pandas series (columns) and if we use names that follow the rules of variable names, we can access them using attributes. It is also possible to rename the indexes although it is more common to leave them unnamed and access them using their numeric values like we have already seen. In this case we will use a for loop to create index names.

# Perquisites 
import numpy as np
import pandas as pd
from scipy import random
# Create Random Data
random.seed(0)
data=random.randint(low=0,high=11,size=(12,5))
# Create DataFrame
columns=['c0','c1','c2','c3','c4']
index=np.arange(start=0,stop=12,step=1)
index=['r' + i for i in index.astype(np.str)]
df=pd.DataFrame(data=data,columns=columns,index=index)

Once index (row) names are created, provided that they are named following the rules behind variable names, they too will show up as attributes to the dataframe or more specifically as attributs to each column in the dataframe. If we type in a dataframe followed by a dot . then a column name as an attribute and then type a another dot . followed by a tab we will see these indexes as attributes.

For example:

df.r0.c0

This also allows us to use both the attributes integer location iloc and location loc to select an index of the dataframe using its numerical value an index (row) name respectively.

df.iloc[4]
df.loc['r4']

Note for dataframes where the indexes are not named, the name of the index will be the numeric value so the methods loc and iloc will appear to be the same. However if the indexes are named loc will not accept the numeric value as a name as it has been changed.

The method at will also use the name of the index. For example:

df.at['r2','c1']

Comma Separated Values (CSV) and Tab Delimited (TXT) Files

So far, every single array we have examined has been manually input. In this section we will instead look at how we can open data from a file, manipulate the data and then save it to a new file or overwrite the original.

We will start off with a spreadsheet created in Microsoft Excel. In this file we have two columns "A" and "B" which contain numeric values.

Let us save this to a comma separated values (.csv) file:

Now let's use an additional program Notepad++

to open this saved csv file. In this file we will select view→show symbol→show all characters to show what are usually hidden formatting/punctuation characters in order to help us better understand how data in the file is stored. Here we can see that a comma , acts a delimiter (separating each value in Column A from the corresponding value in Column B). At the end of each line is CR and LF. These stand for Carriage Return and Line Feed respectively. We need both as CR returns us to the left-hand side of the file and LF brings us onto a new line. The name of these punctuation marks originates from typewriters. To use a typewriter after each line, the carriage had to be returned to the left-hand side of the piece of paper (carriage return) and the piece of paper then had to be slid up by the space of a line (line feed).

We can alternatively save the file as a Text (Tab delimited) .txt file.

We can now open this up in Notepad++ and the only difference we will see between the two files is the delimiter. In the.csv file the comma , is used as the delimiter and in the text file, the is used as the delimiter.

Again, punctuation marks like the are usually hidden but we have enabled them in this case in order to better understand what is going on. We can also save the file as Microsoft's own .xlsx format. Note in this format that our data is contained in an Excel workbook that has Sheet1.

When we try and open this file up in Notepad++ we won't be able to intuitively understand what is going on and it will look like gibberish. This is because this file format has a larger number of features and each character in the workbook will have its size, font, font color etc. In this file format many more formatting characters are used and like CR and LF these are likewise shown using white characters on a white background.

The pandas library has a function called pd.read_csv() which can be used to read in csv and txt files. And although we cannot understand the xlsx file by use of notepad++, the pandas library also contains a similar function pd.read_excel() with more under the hood which can be used with the xlsx file format.

Reading and Writing to CSV, TXT or XLSX File

Let's first use pd.read_csv() to create a dataframe from an csv file when the csv file is in the same folder as the python script. We can verify that are in the same folder by use of the files tab. Because they are in the same folder, the full file path of the file doesn't need to be specified and we can instead use the file name and extension.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
csv_file=pd.read_csv('Book1.csv')

Launching the script will give us the new dataframe csv_file.

If we compare the csv_file dataframe to the original csv file opened in notepad++ we'll see that notepad++ uses 1st order indexing and python uses 0th order indexing. That aside the top row, is missing in the csv_file because the values have been taken as the column names.

If we modify the csv file to have column names and reimport, we can see that these are instead used.

The function read_csv has a position argument, the file to be read in and several keyword input arguments. Recall that we can have a look at the input arguments for the function by typing in the function without any input arguments and hovering over it. Looking at the first 5 keyword input arguments. We can specify a separator however the function uses a comma , by default. The delimiter is just an alias for the separator. The next keyword argument header='infer' will mean the first non-blank line is taken as the column names by default. It can be overridden by changing its value to None and assigning a list to the next keyword value names which has a default value of None. An index column can also be specified by using index_col which by default is set to None.

Removing the headings from the csv file we can import the file as a dataframe by selecting header=None and assigning names to a vector of the desired names.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
csv_file=pd.read_csv('Book1.csv',
                         header=None,
                         names=['c0','c1'],
                         index_col=None)

We may use the keyword input argument skiprows for a file with non-data comment lines, for example in the file below three lines are non-data comment lines and there is no header.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
csv_file=pd.read_csv('Book1.csv',
                     skiprows=3,
                         header=None,
                         names=['c0','c1'],
                         index_col=None)

It is also possible to read in only a selection of rows which can sometimes save time when previewing an excel file with a large number of rows for example using the keyword input argument nrows. Columns may also be selected by using the keyword input argument usecols and selecting the rows to use as a vector of the original row names. We can use the following if we only want to import the data from the first 5 data rows in column 1 of the csv file.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
csv_file=pd.read_csv('Book1.csv',
                     skiprows=3,
                         header=None,
                         names=['c1'],
                         index_col=None,
                         nrows=5,
                         usecols=[1])

Supposing we have the following csv file. Here the 0th column of the raw data contains the index names. We do not have the column names specified in the file and instead we want to create them when reading the file and assigning it to a dataframe. We can therefore use:

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
csv_file=pd.read_csv('Book1.csv',
                     header=None,
                     names=['C1','C2'],
                     index_col=0)

If we now return to Book2.txt and try to read it specifying column names and no index names.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
txt_file=pd.read_csv('Book2.txt',
                     header=None,
                     names=['C1','C2'],
                     index_col=None)

We get the situation above. This is because the separator (delimiter) is taken to be a comma , when it is a tab and because we have specified 2 columns via the use of 2 column names, the second column is full of nan not a number as no numeric values were found. This can of course be amended by specifying the separator or delimiter as a tab to do this we use \t. We can use either sep or its alias delimiter to do this. So far, we have also assumed the file to be read is in the same folder. If it is in a different folder or if we want to manually specify the full path, we can type it in. Recall that we can use a relative string if we copy and paste the filepath from Windows explorer:

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
txt_file=pd.read_csv(r'C:\Users\Phili\Documents\Python Scripts\Book2.txt',
                     sep='\t',
                     header=None,
                     names=['C1','C2'],
                     index_col=None)

This reads the file in correctly:

Now let's use pd.read_excel() to read an Excel xlsx file and save it as a dataframe called excel_file. When reading in an Excel File we can specify both the name of the xlsx file itself and the sheet within the xlsx file. In this case 'Book3.xlsx' and 'Sheet1' respectively.

By default, the keyword argument sheet_name is assigned to a numerical value of 0 meaning it will read in the 0th Sheet created in the excel workbook. This can be changed to another sheet using a numerical value corresponding to the order of the sheets in the Excel workbook or alternatively the sheet name can be input as a string. The information about the keyword arguments for read_excel() isn't as in depth as read_csv() however it shares most of the keyword arguments and has the additional keyword argument sheet_name.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
excel_file=pd.read_excel('Book3.xlsx',
                         sheet_name=0,
                         header=None,
                         names=['C0','C1'],
                         index_col=None)

We have seen before how to index, individual columns and individual indexes. In addition we have also seen how to use the .index and .columns attributes to access the index and column names.

excel_file.index
excel_file.columns

We can rename the columns and indexes using the shape of the imported dataframe and associated for loops.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
excel_file=pd.read_excel('Book3.xlsx',
                         sheet_name=0,
                         header=None,
                         names=['C0','C1'],
                         index_col=None)
(nrows,ncols)=np.shape(excel_file)
ridx=np.arange(nrows)
ridx=['r' + i for i in ridx.astype(np.str)]
excel_file.index=ridx
cidx=np.arange(nrows)
cidx=['c' + i for i in cidx.astype(np.str)]
excel_file.columns=cidx

Now that we have a modified dataframe, we might want to save it back to a file. To do this we can type in the dataframes name followed by its method .to_excel(). We need to specify the file name as an input argument. We can add a sheet_name as a keyword argument. This has a default value of 'Sheet1'. Note if the excel file is already present, it will be overwritten without warning.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
excel_file=pd.read_excel('Book3.xlsx',
                         sheet_name=0,
                         header=None,
                         names=['C0','C1'],
                         index_col=None)
(nrows,ncols)=np.shape(excel_file)
ridx=np.arange(nrows)
ridx=['r' + i for i in ridx.astype(np.str)]
excel_file.index=ridx
cidx=np.arange(nrows)
cidx=['c' + i for i in cidx.astype(np.str)]
excel_file.columns=cidx
excel_file.to_excel('Book3_Updated.xlsx',
                    sheet_name='Updated_Sheet')

Once again if a full file path is not specified, the generated excel file will be in the same folder as the script. A full path or relative path can be input to save the file elsewhere.

There is a similar method to_csv() and like read_csv() doesn't have a keyword argument sheet_name because csv or txt files don't have sheets. to_csv can be used to write both csv and txt files.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
excel_file=pd.read_excel('Book3.xlsx',
                         sheet_name=0,
                         header=None,
                         names=['C0','C1'],
                         index_col=None)
(nrows,ncols)=np.shape(excel_file)
ridx=np.arange(nrows)
ridx=['r' + i for i in ridx.astype(np.str)]
excel_file.index=ridx
cidx=np.arange(nrows)
cidx=['c' + i for i in cidx.astype(np.str)]
excel_file.columns=cidx
excel_file.to_excel('Book3_Updated.xlsx',
                    sheet_name='Updated_Sheet')
excel_file.to_csv('Book3_Updated.csv')
excel_file.to_csv('Book3_Updated.txt')

Missing Data

Let's now look at a spreadsheet which has some cells that have an error for example a division by zero error or have missing data. If we import this as a dataframe these will be replaced by nan denoting not a number.

# %% Prerequisites
import numpy as np
import pandas as pd
# %% Read A File
excel_file=pd.read_excel('Book4.xlsx',
                         sheet_name='Sheet1',
                         header=None,
                         names=['c0','c1'],
                         index_col=None)

We can use the pandas functions isna() or notna() to check if a value is or isn't a nan.

pd.isna(excel_file)	
pd.notna(excel_file)

This returns an output Boolean corresponding to the input.

We can also use the method sum, so sum the number of nan values and not nan values.

pd.isna(excel_file).sum()
pd.notna(excel_file).sum()

The method dropna can also be used on a dataframe. We can type it with open parenthesis to see the keyword input arguments:

excel_file.dropna(

We can see that by default axis=0, how='any' and inplace=False. This means it will act on indexes (rows) and remove any indexes (rows) that have any nan value and this will be saved as an output matrix.

excel_file.dropna(how='any')

On the other hand, it can be changed to how='all' which will only remove indexes (rows) where all values are missing.

excel_file.dropna(how='all')

It is also possible to specify a subset of pandas series (columns) to remove rows from which have 'any' or 'all' nan values. For instance, if we were only interested in removing indexes that had a nan value in the panda series 'c1' we would assign the keyword subset to a single element vector.

excel_file.dropna(how='any',subset=['c1'])

The method fillna can be used to fill any missing values, we can once again input it with open parenthesis to see the keyword input arguments.

We can fill in the data with a number, say for example the number 0.

excel_file.fillna(0)

Alternatively, we can use a dictionary to fill each column with a specific value for instance replacing every missing value in pandas series 'c0' with 0 and panda series 'c1' with 99.

fill_dict={'c0':0,'c1':99}
excel_file.fillna(value=fill_dict)

Which may be particularly useful if each pandas series is a different dtype for example if you have a dataframe where a series that is numeric, a series that is Boolean and a series that consists of strings it will make more sense to have a default value for the nan values is required to suit each series.

The keyword input argument method can be set to forward fill 'ffill' or backfill 'bfill' where each missing value will take the previous non-missing value or next non-missing value respectively.

excel_file.fillna(method='ffill')

Forward fill requires a previous number. If we replace the value on the zeroth index of series 'c1' with None we will see it is unable to replace the nan values on the first two indexes of 'c1'.

excel_file.at[0,'c1']=None
excel_file.fillna(method='ffill')

The pandas dataframe also has the method interpolate which will only work on numeric values and uses the same code as the scipy interpolate under the hood. We can type it in with open parenthesis to view the keyword input arguments:

We can try uses default linear interpolation where all keyword input arguments are left at their default values:

excel_file.interpolate()

And in the case of our very simple dataset the values are interpolated correctly.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.