The Python and Data Analysis (pandas) Library
This guide will look at the use of the pandas library which is an abbreviation for the Python and Data Analysis library. This library is built around the dataframe data structure which can be conceptualised as an Excel spreadsheet.
In this guide I will use the Anaconda 2021-11 Python Distribution with the Spyder 5 IDE to look at the core concepts behind this Python library. If you haven't already made yourself comfortable with the Spyder 5 IDE, the Python Programming Language and the numpy data science (which pandas is built upon), please check out my other guides before starting this guide to get the most out of this guide:
Table of contents
This website is maintained by an individual and technology enthusiast, Philip Yip. Although I have been recognised as a Dell Community Rockstar and Microsoft MVP, I am affiliated with neither company. If you've found my tutorials helpful, please consider making a one-time small donation to offset the WordPress Premium Plan costs to host the website.
This website is maintained by an individual and technology enthusiast, Philip Yip. Although I have been recognised as a Dell Community Rockstar and Microsoft MVP, I am affiliated with neither company. If you've found my tutorials helpful, please consider making a monthly small donation to offset the WordPress Premium Plan costs and the costs for buying computer hardware. I am also spending a considerable amount of time doing some programming courses (Python, C++, Qt and Arduino) and hope to write some more programming tutorials.
Choose an amount
Thank you very much.
Thank you very much.
DonateDonate monthlyA dataframe object
Conceptualising a dataframe
pandas is an abbreviation for the Python and Data Analysis. It is a library based around the dataframe data structure which conceptually can be compared to an Excel sheet:

The dataframe has an index which is normally numeric (using zero-order indexing) highlighted in yellow. Every numeric value along the index is often referred to as an observation, for example observation 0, observation 1 and so on…:

The data in the dataframe is grouped into a series. The series has three parts, the series name which is a string highlighted in orange, in this case "boolean", the series values which are highlighted in blue are essentially a column of data and the data for each series is a constant datatype. Every value highlighted in blue in this case is a bool for example. These blue values essentially inherit from a ndarray of type bool and therefore the numpy functions designed for a bool ndarray can be used on it. Many of the bool series methods will also be similar to a ndarray of type bool as they use inheritance from numpy. Finally we have the index associated with each value highlighted once again in yellow:

If we move onto the next series. We can see that the series name, this time "integer" is once again highlighted in orange, the series values highlighted in blue are now a constant integer type and this time inherits from an ndarray of type int. numpy functions designed for an int ndarray can be used on it and many of the int series methods will use inheritance from the int ndarray. Finally the index is shared with the previous series as both series are displayed side by side in the dataframe:

The next series "floating" is a series of floating point number values:

The next series "string" is a series of strings:

The next series "datetime" is a series of datetime values:

Finally the next series "category" is a series of categories:

Collectively highlighted in orange we have the column names (these are also of the data type index – this index is not numeric but is instead a collection of strings):

We can see that observation 0 has a "boolean" value of True, an "integer" value of 1, a "floating" value of 3.14, a "string" value of "hello", a "datetime" value of datetime(09-01-2022 07:40) and a category of "A".
And that observation 1 has a "boolean" value of False, an "integer" value of 3, a "floating" value of 6.28, a "string" value of "world", a "datetime" value of datetime(09-01-2022 07:41) and a category of "B".
Python dictionary
From the above we see that each series has a name and a series of values. We can use a Python dictionary to group a collection of data to a name:
data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] }

Viewing the dictionary in the variable explorer we see:

Notice that each value is an equally sized list of size 6. If we expand one of these lists for example, the "boolean" list we see that every value has a numeric index that has been automatically generated and there is a constant datatype bool for each item in this list:
The list "floating" has a matching numeric index and this time the constant datatype for each element in the list is float:
Importing the data science libraries
We will use two data science libraries, numpy and pandas. These are located in:
%UserProfile%\Anaconda3\Lib\site-packages
There is a numpy folder with a __init__.py file and a pandas folder with a __init__.py file. These files are referenced when we type in:
import numpy as np import pandas as pd
And as both are going to be commonly reference we import them using the 2-3 letter alias np and pd respectively. As these are third-party libraries not included in the Python Standard Module Index, we need to run these two lines of code so the libraries are imported into the kernel. Doing so will allow us to use auto-completion:

Object Orientated Programming Recap
Now when we type in:
pd.
We can view a number of objects that we can call from the pandas library:

We see a list with classes at the top, followed by mainly functions including the odd module and variable.
Notice that each class is CamelCaseCapitalized such as Index, Series and DataFrame. pandas is a third-party Python library and therefore follows the PEP 8 Style Guide for Python Code which states that all third-party class names should be CamelCaseCapitalised.
Class names should normally use the CapWords convention.
Note that this naming convention only applies to third-party classes and the inbuilt Python classes int, float, bool, str, list, tuple, dict for example are not CamelCaseCapitalized.
In the case of numpy, the ndarray class is also not CamelCaseCapitalized however in the case of numpy the ndarray class is very rarely used to initialize a new instance directly and instead the array function is used. The ndarray class therefore uses the naming convention of a function:
The naming convention for functions may be used instead in cases where the interface is documented and used primarily as a callable.
Before working with the DataFrame class it'll help to review some of the features behind a third-party class. A class can be thought of as a blueprint that contains data in the form of attributes (objects) and functionality in the form of methods (functions).
A class can be conceptualised as a blueprint, for example the blueprint of a house. The blueprint itself is abstract but can be used to construct a house which is a real object that we, as a home owner or renter can interact with. A physical house will have data attributes such as the height of the house and other dimensions as well as the number of rooms in the house. It will also have functionality in terms of the operation of the doors of the house and the central heating settings which the user can change.
In Python, a class is a set of instructions based around an object. The object normally has data in the form of attributes and functionality in the form of methods. An attribute is an object that is accessed with respect to another object. A method is a function defined within a class.
The class normally has a constructor method which is used to initialise a new instance of the class. Recall that the values one = 1, two = 2 and three = 3 are all instances of the int class.
one = 1 two = 2 three = 3
These instances all share the same list of methods functions) and attributes (properties/objects) but are initialised to have a different numeric value.
In each case the attribute real is initialised to have a different value for each instance. For example if we use:
one.real two.real three.real
Notice that we access an attribute by typing in the instance name, the attribute belongs to followed by a . and then the attribute name.

We observe a different value as this attribute was initialised to a different value each time when the class was instantiated.
In the case of a method, as it is a function belonging to the object, defined in the objects class. We need to provide parenthesis, enclosing any required input arguments in order to call the function.
For example the complex conjugate can be called, which requires no input argument:
one.conjugate()

Because an int only has a real component, the imaginary component is taken as 0 and as the complex conjugate only switches signs of the imaginary component, the result is therefore unchanged.
Let's now create a CustomClass saved in a file called customclass.py. This simple class has two methods and three attributes.
class CustomClass(object): def __init__(self, x, y): self.x = x self.y = y self.nprint = 0 def print_attributes(self): print(f"x is {self.x}, y is {self.y}") print(f"print_attributes run #{self.nprint}") self.nprint += 1
In another Python script file called object_orientated_progaming.py. In this file let's import the CustomClass and then instantiate an instance custom_one:
from customclass import CustomClass custom_one = CustomClass(1, 2)

In line 1 of object_orientated_programming.py we look at the file custom_class (indicated in orange) and import the blueprint (CustomClass) indicated in pink.
In line 2 of object_orientated_programming.py we create an instance custom_one of the class CustomClass. If we run this with the debugger and step into it we can see that the __init__ datamodel method is called.
The first positional input argument of the __init__ datamodel method self is inferred by the object name, custom_one (indicated in red).
The second and third positional input arguments must be supplied when instantiating a new instance as indicated in green and blue respectively. These are then assigned to become attributes,
The attribute nprint is assigned to the value 0 as indicated in cyan.
The instance will also have access to the method print_attributes as defined in the class.
These can all be accessed by typing in the objects name followed by a .
custom_one.
The attributes x and y can be accessed from the instance custom_one and assigned to the variables x and y which display in the variable explorer:
x = custom_one.x y = custom_one.y
The method print_attributes can be called using open parenthesis:
custom_one.print_attributes() custom_one.print_attributes() custom_one.print_attributes()
Notice how the attribute nprint, increments by 1 every time this method is run:

In the example above we have used only simple integer values for each attribute however the attribute can be any other object including an instance of another third-party Class.
When using pandas, we are going to be working with the DataFrame class. If we have a look at the __init__.py file within the pandas folder.
%UserProfile%\Anaconda3\lib\sit-packages\pandas
We can see the line:
from pandas.core.api import DataFrame

Now if we have a look at the api.py file within the core folder we see:
from pandas.core.frame import DataFrame

And if we have a look at the frame file we see. The blueprint for the DataFrame class including its __init__ method.

As an end user for the pandas library we only want to use the classes above, to create data structures for our data. We do not want to view or modify the class ourselves. Going back to our previous example:

We are only focused on creating an instance custom_one of the CustomClass and then interacting with the instance we created:

Instantiating a dataframe
To create a dataframe object (lower case for an instance) we need to call the constructor method of the DataFrame class (upper case for a class). Recall that the class can be thought of as an abstract blueprint which defines how an object is constructed and what attributes (properties) or methods (functionality) the object has.
The first input argument for the DataFrame class is data:

We can assign data to a dictionary. The dictionary must have keys that are in the form of a str and each value must be a list of the same length. When used to construct an instance df of the DataFrame class, the keys in the dictionary will become the column names and the list of values in the dictionary will become the data for each series.
We can specify data as a keyword argument or positional input argument:
df = pd.DataFrame(data=data_dict)
df = pd.DataFrame(data_dict)
It is also possible to instantiate a df using data as a matrix usually in the form of numpy array, alongside the additional keyword input argument columns which is usually in the form as a list of strings. Each column in the matrix will become its own series within the dataframe. The number of strings in the list of strings used for columns must match the number of columns used for the data matrix. Once again data can be specified as a keyword or positional input argument.
df = pd.DataFrame(data=data_matrix, columns=columns_list)
df = pd.DataFrame(data_matrix, columns=columns_list)
Note however that columns is the third positional input argument and there is an additional input argument index that can be used to insert a custom index for example a non-numeric index:
df = pd.DataFrame(data_matrix, None, columns_list)
If a custom index is used, the index must match the number of rows in the data matrix or the length of each list in the dictionary of string keys and lists values:
df = pd.DataFrame(data=data_matrix, index=index_list, columns=columns_list)
There is an additional keyword input argument dtype which can be used to set the datatype for the data matrix. Note this input argument is singular and will only apply a single datatype across the entire dataframe and therefore has limited use. There is no means to assign the datatype of each series individually in the form of a list when instantiating a new dataframe.
If a dataframe is instantiated from a dictionary, the datatype for each series will be automatically determined by the series data.
Indexing
Let's use our dictionary from earlier:
import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } data = pd.DataFrame(data=data_dict)

This gives a dataframe of the form:
Let's compare this to the dictionary directly:

Recall that the dictionary doesn't use numeric indexes but instead uses keys which are usually in the format of a str. Indexing is carried out with square brackets:
data_dict["boolean"]
We can index into the dataframe to get a series in much the same manner:
data["boolean"]
Note the list is shown as a row and the series is always displayed as a column with the corresponding numeric index for each value. The series name and datatype is also given:
Multiple columns can be selected by using a list of the column names outputting a new dataframe:
data[["boolean", "integer"]]

This can also be used to reorder the columns in the dataframe:
data2 = data[["integer", "boolean", "floating", "string"]]
Recall that a new key, value pair can be added to a dictionary by indexing into the dictionary with a new key and assigning it to a new value:
data_dict["datetime"] = times
This can be likewise done with a dataframe using the same syntax:
data["datetime"] = times
Where times is a new list or array with matching dimensions to the index of the dataframe.
import numpy as np import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } data = pd.DataFrame(data=data_dict) times = [np.datetime64("2022-01-09T07:40"), np.datetime64("2022-01-09T07:41"), np.datetime64("2022-01-09T07:42"), np.datetime64("2022-01-09T07:43"), np.datetime64("2022-01-09T07:45"), np.datetime64("2022-01-09T07:46")] data_dict["datetime"] = times data["datetime"] = times
The np.datetime64 values don't display properly in the Spyder variable explorer unless they are displayed within a dataframe:

This syntax could also be used to replace values within an existing column.
data["integer"] = new_int

We can use type to clarify the data type of an object. Let's have a look at type of the objects below:
type(data) type(data["boolean"])

We see that data is an instance of the DataFrame class and data["boolean"] is an instance of the Series class. Note that in this particular case the series object is a property that can be accessed with respect to the dataframe object. This means the series object it can be accessed as an attribute of the dataframe object and we can us the dot notation used throughout Python:
data.boolean

There are some slight nuances with the dot notation when used with a dataframe however:
The series name cannot have a space, because a space is forbidden in an object name and an attribute is in essence an object. The series name also cannot be numeric as we can only use a number incorporated as part of a string for an object name and can't begin an object name with a number.
If there is a space, the attribute name will not be accessible and the only way to select the series will be to index into it using square brackets containing the series name as a string.
The dot notation is only available for an existing series attribute and cannot be used to create a new series attribute. If you try to do so, you will get the following user warning:
UserWarning: Pandas doesn't allow columns to be created via a new attribute name

The attribute notation is only available once a series has been created. It can however be used to access values from an existing attribute and to replace the values within an existing attribute.
Now that we have selected a series from within a dataframe we can now index into it using the index:
data["boolean"][2]
The rows can only be accessed using square brackets and the numeric index by default (similar to indexing a value within a list).

By default the rows names are only numeric and the series therefore does not have an attribute corresponding to each data value (row) in the series.
We can update our dataframe to have a new index of observation names using the following:
index_names = ["r0", "r1", "r2", "r3", "r4", "r5"]
Which can also be created using list comprehension:
index_names = ["r"+str(i) for i in range(6)]
Then we can recreate our dataframe using the new index of observation names:
import numpy as np import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } times = [np.datetime64("2022-01-09T07:40"), np.datetime64("2022-01-09T07:41"), np.datetime64("2022-01-09T07:42"), np.datetime64("2022-01-09T07:43"), np.datetime64("2022-01-09T07:45"), np.datetime64("2022-01-09T07:46")] data_dict["datetime"] = times index_names = ["r"+str(i) for i in range(6)] data = pd.DataFrame(data=data_dict, index=index_names)

Notice that the index observation names are now in the form of strings.

Now we can index a value from the series using:
data["boolean"]["r2"] data.boolean.r2

Note that r2 is an attribute of the series which is itself an attribute of the dataframe. r2 cannot be accessed as an attribute directly from the dataframe:

Instead we have the following attributes location loc and integer location iloc which can be used to select the row or selection of rows by using square brackets containing the row name or row number respectively:
data.iloc[2] data.loc["r2"]
Note the datatype returned is a series of type object and as a series is usually a column is displayed as a column:

As this is essentially a series it can be indexed further using square brackets:
data.iloc[2][2:4]

Indexing using loc or iloc is usually carried out selecting the observation and series in a single set of square brackets:
data.iloc[2, 2:4] data.iloc[[2, 3], [0, 1]] data.loc[["r2", "r3"], ["boolean", "integer"]]

The attributes at and integer at iat an be used to select a single cell by using the row name and then the column name or row int and column int respectively:
data.at["r2", "boolean"] data.iat[2, 0]

Renaming column names and index names
The columns and index are attributes of the dataframe, both have the same datatype index which is essentially a list of strings:
type(data.columns) data.columns type(data.index) data.index

The list of attributes and methods available mainly correspond to string operations.

We can rename these by accessing the attribute directly and assigning it to a list or range object of equal size:
data.columns = ["bool2", "int2", "float2", "str2", "time2"] data.index = np.arange(0, 6)

Alternatively we can rename these using the dataframe's method rename and we can specify either the columns or index and assign it to a dictionary of key value strings where the key is the old name and the value is the new name. For example if we rename the column bool2 to the column bool3:
data.rename(columns={"bool2": "bool3"})
Notice that the console displays an output and the column is updated in the output. If output is shown in the console it means the dataframe method or pandas function has a return statement and that changes aren't made inplace.

The dataframe data for example remains unchanged:

To make changes we need to reassign the output to the dataframe data:
data = data.rename(columns={"bool2": "bool3"})

This method has a keyword input argument inplace which shows up in many of the dataframe methods however is scheduled for deprecation (removal pretty much everywhere in future pandas versions).
In the pandas functions, pandas dataframe methods or pandas series methods where inplace is currently present. inplace has a default value of False and the method will return a new dataframe as we have previously just seen. It can otherwise be ignored.
If inplace is overridden to True, then the method will have no return statement and instead the dataframe will be updated to reflect the changes. This causes a lot of confusion for beginners and is the reason for the removal of inplace.
To demonstrate this I will rename the column bool3 back to bool2:
data.rename(columns={"bool3": "bool2"}, inplace=True)
And we can see that this works as expected:

Care should be taken when using the keyword input argument inplace as the output has no return statement. If I attempt to rename the dataframe column bool2 back to bool3 and assign the output to the dataframe:
data = data.rename(columns={"bool2": "bool3"}, inplace=True)
Notice that we have performed two operations. We have updated data inplace and then assigned the output of the function (which has no return statement and therefore returns None) to data. This results in loss of our dataframe:

If we recreate the dataframe, we can rename an index using:
data = data.rename(index={0: "r0"}, inplace=True)
Dropping a column or observation
Let's recreate our dataframe using:
import numpy as np import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } times = [np.datetime64("2022-01-09T07:40"), np.datetime64("2022-01-09T07:41"), np.datetime64("2022-01-09T07:42"), np.datetime64("2022-01-09T07:43"), np.datetime64("2022-01-09T07:45"), np.datetime64("2022-01-09T07:46")] data_dict["datetime"] = times data = pd.DataFrame(data=data_dict)
We can drop a series or an observation by using, the dataframe method drop which once again has the keyword input arguments columns and index. These act in the same manner as described before when using rename however columns and index are expecting a list of names to drop opposed to a dictionary of key:value pairs of old names:new names.
data = data.drop(columns=["datetime"])
data = data.drop(index=3)
Notice that the numeric index now has a missing value 3. We can use the method reset_index to reset it. We need to use the keyword input argument drop and assign it to True, to prevent the old index from being added as a new series:
data = data.reset_index(drop=True)
Inserting a column
We have seen before that it is relatively easy to add a new column to a dataframe by indexing into the dataframe with a set of square brackets enclosing the name of the new column and then assigning it to a list or series of the correct dimensions.
data["integer2"] = np.arange(0, 6, 1)
This will append the new column to the far right.
We have also seen how we can use square indexing to make a selection from the original dataframe or to reorder the original dataframe.
We can also use the dataframe method insert, to insert a column at a certain location. We need to specify the input arguments loc to assign the numeric integer value of the column, column to assign the column name in the form of a string and values in the form of a list, numpy array or series.
data.insert(loc=2, column="integer3", value=np.arange(0, 6, 1))
This method makes inplace changes to the dataframe and has no keyword argument inplace.
Unfortunately the insert method also lacks the keyword index and we can't use it to insert an observation directly into the dataframe…
We therefore must use the dataframe method append or the pandas function concat which are designed to work with two dataframes which have the same columns.
Appending and Concatenation
To append or concat we must have two dataframes with the same column names:
import numpy as np import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } data_dict2 = {"boolean": [True, False], "integer": [11, 33], "floating": [14.22, 15.01], "string": ["and", "tomorrow"] } data = pd.DataFrame(data=data_dict) data2 = pd.DataFrame(data=data_dict2)
Appending is a dataframe method whereas concat is a pandas function. To append we need to call the append method from the dataframe we wish to have on top and then supply the second dataframe as the other input argument:
data.append(other=data2)
Notice that the index now goes 0,1,2,3,4,5,0,1 which displays observations with the same names in the index. If I use the input argument ignore_index and reassign it to True, the index will display as 0,1,2,3,4,5,6,7. We can reassign the output to the original dataframe data to make inplace changes.
data = data.append(other=data2, ignore_index=True)
We can use the method concatenate to concatenate two dataframes along an axis.
The default value for axis is axis=0 (which means we concatenate along the row axis by default. We can change this to axis=1 to operate along columns instead.
You may recall from working with numpy arrays we can get the shape of a numpy array or dataframe using the attribute shape with the -1st index of the shape tuple always corresponding to the number of columns and the -2nd index of the shape tuple always corresponding to rows. In the case of numpy it was more useful to think of columns as the -1st and rows as the -2nd axis as we could have n-dimensional arrays.
shape = data.shape ncols = data.shape[-1] nrows = data.shape[-2]

A dataframe on the other hand is always 2d and most of the dataframe methods do not accept the negative index. An axis=0 will always refer to working along the index axis and can also be assigned to the string axis="index". An axis=1 will always refer to working along the column axis and can also be assigned to the string axis="columns".
In the example above we see (going from right to left i.e. using negative indexing) that the dataframe has 4 columns and 6 rows:
The concat function has the input arguments objs which is a tuple of the data to be concatenated, followed by the axis, join (which is a string "outer" or "inner") and ignore_index which we have seen earlier:
data3 = pd.concat(objs=(data, data2), axis="index", join="outer", ignore_index=True)
Let's create two dataframes and concatenate along axis=1 (axis="columns"). data has observations 0-6 along its index and data2 has observation 0 and 3 along its index. Let's concatenate these using outer and inner joining and assign the outputs to the dataframes data3 and data4 by default. For clarity we will set ignore_index=False:
import numpy as np import pandas as pd data_dict = {"bool1": [True, False, True, False, True, False], "int1": [1, 3, 5, 7, 6, 9], "float1": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "str1": ["hello", "world", "it", "is", "raining", "today"] } data_dict2 = {"bool2": [True, False], "int2": [11, 33] } data = pd.DataFrame(data=data_dict, index=[0, 1, 2, 3, 4, 5]) data2 = pd.DataFrame(data=data_dict2, index=[0, 3]) data3 = pd.concat(objs=(data, data2), axis="columns", join="outer", ignore_index=False) data4 = pd.concat(objs=(data, data2), axis="columns", join="inner", ignore_index=False)
Notice how the dataframe data has an index with 6 observations numerically labelled 0, 1, 2, 3, 4, 5 and 6:
The second dataframe data2 has an index with observations numerically labelled 0 and 1. These are a subset of the observations available in dataframe data. The dataframe data2 has 2 different columns not present in the dataframe data:
If we use an outer join (default), we get nan (not a value) for each observation in data not present in data2 and vice-versa:
If instead we use an inner join. Then only the observations present in both the dataframes being concatenated are joined, in this case observation 0 and 3 respectively:
To add an additional observation to an index we must append it or use concatenation. Both of these only work with an existing dataframe as we have seen above. To convert a new row from a list of values, we can zip the column name (which recall is a pandas index that can be thought of as a list of strings) as keys with the new_row values and then cast this zipped object into a dictionary.
new_row_list = [False, 10, 100.2, "and"] new_row_dict = dict(zip(data.columns, new_row_list))
We can then use this to create a DataFrame with a single observation. Note we get the error message:
ValueError: If using all scalar values, you must pass an index
unless we specify an index in the form of a list. As we want our index to be the next row in this case observation 6 (which is the length of 0,1,2,3,4,5 using zero order indexing) we can assign this to a single elment list with the value data.shape[0] (recalling the 0th index of a dataframe i.e. 2d array is 0):
new_row = pd.DataFrame(data=new_row_dict, index=[data.shape[0],])
Finally we can append this new observation to data using:
data = data.append(other=new_row)
Reading and Writing to Files
Comma Separated Values (CSV) and Tab Delimited (TXT) Files
Let's open up Microsoft Excel and create a basic spreadsheet. Notice that this is in the same format as our dataframe. Column B has a column title boolean followed by automatically detected boolean values and capitalised by Excel. Excel recognises strings as text and doesn't place quotations around them. It also recognised numerical data (ints and floats). Excel also formats datetime data automatically:
Let's first save this as as Book1.csv a comma separated values file:

We can inspect this file in Notepad++ and see the comma is uses as a delimiter (which essentially gives instructions to move onto the new column). To get a new observation we instead use CR and LF (Carriage Return and Line Feed respectively). The string "hello, hello" is enclosed in quotations to indicate that the , is part of the string and not a delimiter:

We can now save the file as a tab delimited text file. Now viewing this in Notepad++ we can see that the tab is used is a delimiter opposed to a comma and otherwise all content in the file is the same:

If we instead save the file as an .xlsx file we see that it is harder to understand in Notepad++. This is because the file has much more capabbility, it is a workbook with multiple sheets and each item in each cell can be formatted:
Reading a CSV File or a TXT File
Pandas has a read_csv function which can be used to read csv files or tab delimited files and convert them to a dataframe. By default it is setup to read a csv file by default (as the function name indicates). The function has a variety of keyword input arguments which are setup by default to read data formatted using the example csv file above. Notice that the script file and the csv file are in the same folder, this means I can just specify the file name:
import numpy as np import pandas as pd data = pd.read_csv("Book1.csv")

If you want a file from a different folder you would need to use the full file path, in this case using the relative string:
full_file_path = r"C:\Users\Public\Documents\Book1.csv" data = pd.read_csv(full_file_path)
Recall that the relative string automatically converts every \ to a \\ where the \ indicates we are inserting an escape character into a string and the second \ indicates that the escape character we want is \

The data is exactly in the format we desire for a dataframe however if we have a look at the dtypes
data.dtypes
We can see that only the numeric datatypes are recognised and the string, datetime and category series are all recognised as object (which means pandas recognises these as strings). We will have a look at these other datatypes in a bit more detail later:

If we type in:
pd.read_csv()
We see the docstring.

Keyword Argument | Possible Values | Description |
sep delimiter (alias) | "," "\t" | comma delimiter tab delimiter |
decimal | "." "," | UK and USA use . for a decimal and , for a csv. Germany and France use , for a decimal and ; for a csv. |
header | "infer" None | infer header (from name or numeric values of top rope) no header |
names | ["new1", "new2", …] | List of new column names It is recommended to replace any column names with a space. To ones without a space,. so they display as dataframe attributes. |
index_col | None 0 1 … | Zero-Order value of Index Column which contains index i.e. list of observation names. |
usecols | None [0] [0,2] [0:2] | Instructions on what columns to use. If None is set (default) all columns are imported. |
skip_blank_lines | True False | Skips blank lines. Does not skip blank lines. |
nrows | None 0 1 2 3 | Number of observations to import. If None is set (default) all observations are imported. |
skiprows | None 0 1 2 | Number of rows to skip at the top of the file corresponding to the header. |
skipfooter | 0 1 2 | Number of rows to skip at the bottom of the file corresponding to the footer. |
parse_dates | False ["datetime"] | Import all dates as a string. List of column names to attempt parsing dates. |
infer_datetime_format | False True | Do not automatically attempt to infer the datetime format. Automatically attempt to determine if datetime format is UK or US. |
dayfirst | False True | US Format UK Format |
To read the tab delimited text file and to parse the dates in the Series "datetime" in the UK format we can use:
import numpy as np import pandas as pd data = pd.read_csv("book2.txt", sep="\t", parse_dates = ["datetime"], infer_datetime_format = False, dayfirst=True)

And if our data had no headings and we only wanted a selection of the columns:

import numpy as np import pandas as pd data = pd.read_csv("book1.csv", header=None, names=["bool2", "int2", "float2", "datetime"], usecols = [0, 1, 2, 4], parse_dates = ["datetime"], infer_datetime_format = False, dayfirst=True)

Reading an Excel XLSX File
There is a very similar function to read in an Excel File. This function will ignore any formatting in the Excel Workbook and just import the data. The keyword input arguments are mainly a subset of those available for read_csv because the delimiter (alias separator), decimal character and datetime format are provided by the Excel file for example so we don't need to provide these or attempt to infer them.
Excel Workbooks have multiple sheets with the default sheet name being called "Sheet1" at position 0 (zero-order indexing). The keyword argument sheet_name=0 will refer to this sheet by default however can be updated to refer to the numeric value of another sheet or the sheet name in the form of a string.
In this example the sheet name at index 0 is "Book2" this was taken because the data was previously saved as Book2.txt.
To read this in as a dataframe we can use:
import numpy as np import pandas as pd data = pd.read_excel("Book3.xlsx", sheet_name="Book2", parse_dates = ["datetime"])

Writing a dataframe to File
Any dataframe has the methods to_excel and to_csv. We can once again get details about the input arguments by typing in the method with open parenthesis and we have seen most of these before:
data.to_csv()

We can see the first input argument is the file path or buffer. We can specify the file name as the form of a string and this new file will be generated in the same folder as the Python Script file. We will add a new series to the dataframe and then resave it as a csv, tab delimited text file and excel file. For the tab delimited text file, we need to supply the separator (otherwise the default value, the comma will be used):
import numpy as np import pandas as pd data = pd.read_excel("Book3.xlsx", sheet_name="Book2", parse_dates = ["datetime"]) data["new"] = np.arange(6) data.to_csv("Book4.csv") data.to_csv("Book5.txt", sep="\t") data.to_excel("Book6.xlsx")
These files display in the files tab and can be opened in Notepad++:



Note an index is added. This can be removed by assigning index to False:
import numpy as np import pandas as pd data = pd.read_excel("Book3.xlsx", sheet_name="Book2", parse_dates = ["datetime"]) data["new"] = np.arange(6) data.to_csv("Book4.csv", index=False) data.to_csv("Book5.txt", sep="\t", index=False) data.to_excel("Book6.xlsx", index=False)
This gives the same type of data as we read in earlier:


Writing dataframes to Sheets within an Excel File
Using the procedure above with different sheet names will simply overwrite the file each time opposed to appending a sheet to an existing file name.
In order to do this we must create an instance of the ExcelWriter class (and as this is another class it is CamelCaseCapitalized). This class has an input argument path and we spcify the file name in the form of a string.
Now we once again use the to_excel method for each dataframe but specify the writer opposed to a seperate file and must also specify a sheet_name (otherwise we will overwrite the same sheet multiple times).
Finally when we are done we call the method save from the writer.
import numpy as np import numpy.random as random import pandas as pd #%% generate Data data = pd.DataFrame({"zeros": np.zeros((10, ))}) data2 = pd.DataFrame({"ones": np.ones((10,))}) data3 = pd.DataFrame({"randint": random.randint(0, 10, (10,))}) #%% Write Sheets writer=pd.ExcelWriter(path="Book7.xlsx") data.to_excel(writer, sheet_name="0_10") data2.to_excel(writer, sheet_name="1_10") data3.to_excel(writer, sheet_name="r_10") writer.save()
Filtering and Sorting
Boolean Values
Let's recreate our simple dataframe using:
import numpy as np import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [1, 3, 5, 7, 6, 9], "floating": [3.14, 6.28, 9.37, 10.28, 11.69, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } data = pd.DataFrame(data=data_dict)
Now notice that the series boolean is a series of bool values that are either True or False.
We can assign this series to a variable name directly using:
boolean = data.boolean
Now we can index into the dataframe using this series:
data[boolean]
This will filter using the bool array and in this case only observations at 0, 2 and 4 have a True value:
Our new output dataframe is therefore a subset of our original dataframe displaying only these observations:

Instead of a series, we can index using a list of bool values or a numpy array of bool values that are equal in size to the number of observations in the dataframe:
bool_list = [True, False, True, False, True, False] data[bool_list]

bool_array = np.array([False, True, False, True, False, True]) data[bool_array]

Instead of creating a bool series, bool list or bool ndarray as a separate object in the variable explorer. We can directly index into a dataframe using a series within the dataframe. We can index using a bool series as an attribute:
data[data.boolean]
Or via the use of square brackets:
data[data["boolean"]]

In the outer brackets we are indexing a series into data. In the inner brackets we are selecting a series from data.
Instead of creating a list of boolean values directly as a separate series within the dataframe, we can get a series of bool values by comparing a series or series using conditional operators.
conditional operator | meaning |
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
== | equal to |
!= | not equal to |
For example:
data.integer > 5
We can also use:
adjoining operator | meaning |
& | and |
| | or |
For example we can look at the condition when the integer in the integer series is greater than 5 and when it is even using:
(data.integer > 5) & (data.integer % 2 == 0)
And the example when the integer series is greater than 5 or when it is not even using:
(data.integer > 5) | (data.integer % 2 != 0)

To index using these conditions we need to place them within the square brackets:
data[(data.integer > 5) & (data.integer % 2 == 0)]

Recall that we can use integer location iloc to index into an observation instead of a series. If we set a colon : for all observations, we can select the series we want using a bool list or bool ndarray. For example if we only wanted the first two series boolean and integer we could use:
data.iloc[:, [True, True, False, False]]

Handling Missing Data
So far we have only imported data from Excel which has all values present. In real life some of the data will be missing (of the wrong data type) or calculated by a formula that returns an error such as division by zero:
To read in this file we would have to use:
import numpy as np import pandas as pd data = pd.read_excel("Book8.xlsx", sheet_name="Sheet1", header=None, names=["x", "y"], skiprows=2, skipfooter=2)

Our dataframe displays not a number nan for each missing value or error:
We can use the dataframe methods isna and notna to output a boolean dataframe. In the case if isna every value will be True when the value displays as nan and False when the value is a number:
data.isna()
Then as this output is a dataframe we can use the dataframe method any to check whether any value in an observation is True.
We need to specify the axis=1 (axis="columns") to indicate we want to check for along the column axis.
This gives us a pandas bool series:
data.isna().any(axis="columns")

We can finally index into data to make a dataframe of only the observations we need to pay more attention to:
data[data.isna().any(axis="columns")]
Inversely if we can use notna where every numeric value will be True and every nan value False:
data.notna()
And we can use the dataframe method all to only select observations where everything within an observation is True. We need to once again specify axis=1 (axis="columns") to indicate we want to work along the column axis. This will once again output a bool series:
data.notna().all(axis="columns")
Finally we can index into the dataframe using:
data[data.notna().all(axis="columns")]
We can also carry out the above more directly using the method dropna:
data.dropna()
When using dropna we are working along the index axis to look for any observation that contains nan values and drop the observation accordingly. The default keyword arguments are therefore used:
data.dropna(axis="index", how="any")
We can also optionally only drop observations where all values are missing using:
data.dropna(axis="index", how="all")
If the data is ordered, it can roughly be filled in using a backfill (using the previous value) or forward fill (using the forward value):
data.fillna(method="bfill", axis="index")

data.fillna(method="ffill", axis="index")

It can also be more accurately interpolated using the interpolate method with the methods "nearest", "linear", "quadratic" or "cubic":
data.interpolate(method="nearest", axis="index") data.interpolate(method="linear", axis="index") data.interpolate(method="quadratic", axis="index") data.interpolate(method="cubic", axis="index")

Sorting Data
Let's create a dataframe of unsorted values:
import numpy as np import pandas as pd data_dict = {"boolean": [True, False, True, False, True, False], "integer": [3, 1, 1, 1, 9, 6], "floating": [12.14, 5.82, 9.37, 2.56, 3.08, 12.24], "string": ["hello", "world", "it", "is", "raining", "today"] } data = pd.DataFrame(data=data_dict)
Now let's have a look at sorting the data using the dataframe method sort values. We have the input argument by which can be assigned to a list of column names and axis which can be assigned to axis=0 (axis="index") to indicate that we are wanting to sort data along the index axis.
We can support data numerically along an index for an int, float or bool. Let's for example sort data along the "index" using the data in column "integer":
data.sort_values(by=["integer"], axis="index")

We can perform a secondary sort by providing a second column in the list, in this case "floating":
data.sort_values(by=["integer", "floating"], axis="index")
We can also sort data alphabetically in a string series. For example in the series integer:
In Python each character in a string corresponds to an int value. These can be looked up by using the ord function. The character an int corresponds to can inversely be looked up using the chr function. For example:
ord("A") chr(65)
We can construct a basic for loop to see the sequence of characters. We see that the sequence contains numeric values then upper case and then lower case characters and the order of other keyboard symbols.
Note the earlier integer correspond to non-printable characters for example new line and tab.

Categorical Data
Let's create a dataframe emulating a users selecting and throwing a dice from a box of randomly coloured dices. The dice has 6 numbers and 3 colors are available. In the code I will begrudgingly use the US spelling of colour, "color" as Python libraries generally use US spellings and usage of UK spellings can sometimes result in confusion:
import numpy as np import pandas as pd from numpy import random random.seed(0) data = pd.DataFrame({"num": random.choice((1,2,3,4,5,6), size=12), "color": random.choice(('red','green','blue'), size=12)})
If we have a look at the datatypes:
data.dtypes

We can see the dtype of the series color is object which essentially means the datatype is recognised as strings. We can access this series as an attribute of the dataframe and use the method astype to cast this series into another datatype:
data.color.astype("category")

This method can also be used to cast data into other values for example the integer series can be cast into a float series:
data.num.astype("float")

To make the changes in place we need to reassign the output to the series within the dataframe:
data.color = data.color.astype("category")
The dataframe does not appear to change in the variable explorer but we can now use the series attribute cat to access categorical methods.
Unfortunately the autocompletion in the Spyder script editor lacks when it comes to accessing methods from a nested attribute and no code completion options display:
data.color.cat.

The Console Code completion however fares slightly better. Note we need to press ↹ for the code completion to display here:

We can view the categories as an index using:
data.color.cat.categories

We can index into an existing value within the color series for example at index 1 and reassign it to another category using:
data.color[1] = "green"
Note that this doesn't work if we attempt to use a category that doesn't exist, for example "yellow" giving an ValueError:
data.color[1] = "yellow"
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
We need to first add the category using the method add_categories:
data.color = data.color.cat.add_categories("yellow") data.color[1] = "yellow"
Let's attempt to sort values using our category:
data.sort_values(by=["color"], axis="index")
By default the values are set alphabetically.

We can use the method reorder_categories to reorder the categories. Supposing we want the score for a color dice to be lowest for green, lower medium for blue, upper medium for yellow and highest for red we can use:
data.color = data.color.cat.reorder_categories(["green", "blue", "yellow", "red"])
And now if we want to sort the score from highest to lowest we can use:
data = data.sort_values(["num", "color"], axis="index", ascending=False)
We can rename the categories using one letter abbreviations by using a dictionary where the key is the old category name and the value is the new category name:
data.color = data.color.cat.rename_categories({"red": "r", "green": "g", "blue": "b", "yellow": "y"})
If we want to merge all "b" and "g" categories into "g" we can use logical indexing to select all the "b" categories:
data.color[data.color=="b"]

We can then assign this selection to "g" and scalar expansion will be used to apply this string to all values:
data.color[data.color=="b"] = "g"
Now when we use:
data.color.cat.categories

We see we have an unused category "b". We can remove it using the method remove_categories and specifying "b" as an input argument or by using the method remove_unused_categories which will automatically see that "b" is no longer used and remove it:
data.color.cat.remove_categories("b") data.color.cat.remove_unused_categories()


Once again to apply the changes, we need to reassign the output to the series within the dataframe:
data.color = data.color.cat.remove_unused_categories()
Now that we have categorical data we can use the dataframe method groupby. The groupby method has the input arguments by which we can assign to a series name and axis which we can assign to "index" to indicate that we are going to groupby the "color" category working along the index axis:
gbo = data.groupby(by="color", axis="index")

From the gbo
gbo.
We have a number of statistical methods such as count, sum, max, min, mean, median, var and std which we have seen previously used on numpy arrays.

For example we can groupby "color" and count the number of values for each color:
gbcount = data.groupby(by="color", axis="index").count()

The other statistical methods mentioned above have a similar form when used directly from the groupby object as shown above.
If however we select another series as an attribute from the groupby object and then use a statistical method on the series, we will instead perform the operation on that series and output a series. For example if we wanted the sum of num for each color we could use:
gbsumnum = data.groupby(by="color", axis="index").num.sum()

So far we have manually created categories however we can also create categories from numeric data using the cut function. x is the numeric series we wish to cut into categories. bins is a list of numeric values. If we want 3 categories we must specify 3+1 bins so each category has a lower and upper bound:

For example we can use our num series and set the bins=[0, 2, 4, 6]
pd.cut(data.num, bins=[0, 2, 4, 6])

Notice the categories are labelled (0, 2] < (2, 4] < (4, 6]. The ( indicates the category is exclusive of the lower bound and the ] indicates the category is inclusive of the upper bound.
For example at index 1, num is 6 and is present in (4, 6] and at index 4, the value is 4 and is present in (2, 4]. Instead of using these labels we can specify our own using the input argument labels and assigning the value to a lost of strings equal in length to the number of categories. For example in our case we can create the categories "low", "med" and "high":
pd.cut(data.num, bins=[0, 2, 4, 6], labels=["low", "med", "high"])
We see that these categories are ordered as expected:

We can assign the output to a new series in the dataframe by using square brackets to enclose the new series name:
data["grade"] = pd.cut(data.num, bins=[0, 2, 4, 6], labels=["low", "med", "high"])
Now our dataframe has two categories color and grade, we can groupby a list of both of these categories:
gbcount = data.groupby(by=["color", "grade"], axis="index").count()
Note that this dataframe, has a multi-level index. The groupby method can select the level of the multi-level index to once again create a groupby object that we can perform a statistical operation of for example:
gbcount.groupby(level=0).count() gbcount.groupby(level=1).count()

Date and Time
When dealing with time and date in Python we have either a datetime object which can be thought of as a time stamp (date in a calendar) or a timedelta object which can be thought of as a time difference (time difference from a starting time).
Python has an inbuilt module datetime that can be used to create these objects. numpy also has datetime64 and timedelta64 objects. These have slightly different syntax.
In both cases the year, month, day and then time are input for a datetime object as a series of integer input arguments for the datetime class and a datetime string for the datetime64 class.
For a timedelta object, the day, second and microsecond units are preferred. Units such as the year and the month in particular are not commonly used as they are ambiguous i.e. not constant. The timedelta one again uses a series of integer input arguments while the timedelta64 uses an integer unit pair and multiple timedelta64 objects can be summed:
import datetime import numpy as np import pandas as pd date1 = datetime.datetime(year=2021, month=1, day=16, hour=17, minute=1, second=0) timediff1 = datetime.timedelta(days=1, seconds=1, microseconds=1) date2 = np.datetime64("2021-01-16 17:01:00") timediff2 = np.timedelta64(1, "D") + np.timedelta64(1, "s") + np.timedelta64(1, "us")
Spyder unfortunately only supports the inbuilt datetime objects on the variable explorer and the numpy objects do not display.

On the other hand the datetime64 and timedelta64 from numpy have several advantages and it is relatively easy to use the numpy function arange to create a datetime64 series using a start and stop datetime64 and a timedelta64 step or a timedelta64 series using, a datetime64 start, stop and step. For example:
import datetime import numpy as np import pandas as pd dates = np.arange(start=np.datetime64("2021-01-16 17:01:00"), stop=np.datetime64("2021-02-01 17:01:00"), step=np.timedelta64(1, "D")) times = np.arange(start=np.timedelta64(0, "s"), stop=np.timedelta64(101, "s"), step=np.timedelta64(1, "s"))

Although Spyder has no support for these objects in the variable explorer, we can convert them to a pandas series which Spyder does support on the variable explorer. Recall that a Series has data in the form of values, an index (usually of integer observations) and a series name. We can provide the data in the form of the numpy array alongside the name in the form of a string. If we don't specify the index, a default index will be created:
import datetime import numpy as np import pandas as pd dates = np.arange(start=np.datetime64("2021-01-16 17:01:00"), stop=np.datetime64("2021-02-01 17:01:00"), step=np.timedelta64(1, "D")) times = np.arange(start=np.timedelta64(0, "s"), stop=np.timedelta64(101, "s"), step=np.timedelta64(1, "s")) dates = pd.Series(data=dates, name="dates") times = pd.Series(data=times, name="times")

These series could then be added to dataframes of the correct size and also be used for sorting a dataframe as we have seen before with the other datatypes.