In order to use Pandas we must import it into our Python script alongside its perquisite Numpy. As both these are some commonly used, they are often imported as 2 letter names, so they can easily be referenced:
Pandas DataFrame Basics
DataFrames have the basic form shown below. First of all there is the data which consists of a number of Indexes (Rows) of equal size. These Indexes (Rows) may be assigned a name (defaulting to numeric values) and each element in a Row will correspond to a Column which may also be assigned a name (also defaulting to numeric values). Pandas follows the convention behind Python of 0 order indexing, meaning the starting Row and Column is 0,0 respectively.
Creating a Data Frame
To create a DataFrame like above we must use:
For convenience we can type each Row on a Separate Line. This creates the DataFrame which shows up in the Variable Explorer as expected:
The following code, could also be used.
Note when labelling columns or indexes, it is advisable to avoid using spaces, use an underscore instead and also avoid starting the column or index name with a number:
Having spaces or starting the column, index with a number will create issues when using dot indexing.
Looking at a DataFrames Information
Information about DataFrames can be obtained by using dot indexing. We have already created the variable mydataframe. If we type into the console the DataFrame name followed by a .
Then press the [tab] key, we will be presented with a range of options. To the top we will see all the Column Names:
If instead of the columns used above, we use:
Then we will see that the columns with the spaces and starting with a number do not show on the list because they cannot be dot indexed:
Going back to the original dataset, we can scroll through the list of other functions we can perform on the DataFrame:
When dot indexing is used there are no quotations around the text field being indexed and any input arguments are enclosed in ( )
As DataFrames can be quite large there are two functions, one called head and the other called tail which by default view the top 5 or bottom 5 Rows of a DataFrame, the number of Indexes (Rows) shown can be adjusted using the input argument, in this case it is left blank so defaults to 5:
C0 C1 C2 C3 C4 C5 C6 R0 1 2 3 4 5 6 7 R1 8 9 10 11 12 13 14 R2 15 16 17 18 19 20 21 R3 22 23 24 25 26 27 28 R4 29 30 31 32 33 34 35
C0 C1 C2 C3 C4 C5 C6 R1 8 9 10 11 12 13 14 R2 15 16 17 18 19 20 21 R3 22 23 24 25 26 27 28 R4 29 30 31 32 33 34 35 R5 36 37 38 39 40 41 42
In the case of this simple data set, there are only 6 rows so we print all the rows except the last one or zeroth one respectively.
We can also look at the datatyes abbreviated as dtypes. In this case all columns consist of exclusively numeric values so are int64:
C0 int64 C1 int64 C2 int64 C3 int64 C4 int64 C5 int64 C6 int64 dtype: object
We can look up the index:
Index(['R0', 'R1', 'R2', 'R3', 'R4', 'R5'], dtype='object')
We can look up the columns:
Index(['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6'], dtype='object')
We can look up all the values:
[[ 1 2 3 4 5 6 7] [ 8 9 10 11 12 13 14] [15 16 17 18 19 20 21] [22 23 24 25 26 27 28] [29 30 31 32 33 34 35] [36 37 38 39 40 41 42]]
Or we can look up the entire table using describe:
<bound method NDFrame.describe of C0 C1 C2 C3 C4 C5 C6 R0 1 2 3 4 5 6 7 R1 8 9 10 11 12 13 14 R2 15 16 17 18 19 20 21 R3 22 23 24 25 26 27 28 R4 29 30 31 32 33 34 35 R5 36 37 38 39 40 41 42>
Since one is typically working with Columns in DataFrames, there are a number of ways to select a Column. For example using Dot Indexing, recall you can type in your DataFrame name and then dot . and then press tab to return a list of possible values:
In this case we are select C1 within mydataframe
R0 2 R1 9 R2 16 R3 23 R4 30 R5 37 Name: C1, dtype: int64
Using this notation one may also type in the Column name followed by dot . and press [tab] again. In this case this gives a list of Rows that can be selected within C1:
In this case we are selecting index (row2) R2 within column C1 within mydataframe:
This returns a scalar.
Note that the procedure can be continued and we can carry out a function on this scalar should we wish to do so. For example the Transpose, which is at the top. This would mean we carry out the Transpose of the 2nd Row of the first Column of mydataframe. And of course the transpose of a scalar would return the original scalar.
We can go further yet again and type in dot and then [tab] once again. Now we can carry out another function on the Transpose of the 2nd Row of the first Column of mydataframe. This gives a demonstration into the elegance of dot indexing:
On the hand one may select a Column by using an array of the Column Name:
R0 2 R1 9 R2 16 R3 23 R4 30 R5 37 Name: C1, dtype: int64
Multiple Columns may be picked up using:
Note the double, square bracket, the outside set is required to select an input argument into the dataframe in this case, mydataframe and the inside set close around the input Row Vector.
C1 C2 R0 2 3 R1 9 10 R2 16 17 R3 23 24 R4 30 31 R5 37 38
A single Index can be selected by using loc and then selecting the Index (Row) and then Column:
To select all values in a Row one may use the argument : for the Col:
C0 8 C1 9 C2 10 C3 11 C4 12 C5 13 C6 14 Name: R1, dtype: int64
In addition to using loc and indexing by the numerical value of the index (row) and column. For R1 and C2, this is 2 and 3 respectively:
Multiple values may be selected using:
C2 C3 R0 3 4 R1 10 11 R2 17 18
For the Index (Rows) the input value 0:3 means the 0th Index (Row) up to but not including the 3rd Index (Row) and likewise for the Columns selection the value 2:4 means the 2nd Column up to, but not including the 4th Column.
Replacing A Single Value
Supposing we wanted to update Index (Row) 0 and Column 1, we would need to select it to the left hand side and assign it to a new value on the right hand side:
We could then update the Index (Row) and Column Names respectively:
Here R0 is the old name and R0A is the new name.
Likewise C1 is the old name and C1A is the new name.
In the case of our data, it was already sorted, lowest to highest across both rows and across columns however the value in R0A,C1A is now out of place. We can reorder the data using:
Sub-Selection by Conditional Logic
One can select a sub-selection of the data using conditional logic (use of less-than-or equal to <=, less-than <, equal ==. not equal~=, greater-than >, greater-than or equal to >=). In this case, we can specify less than 99.
Transposing a DataFrame
It is possible to Transpose a DataFrame by using .T:
This may make it easier to work with what were formerly the Indexes.
Of course, transposing the transpose, returns the original dataframe: