Python: Using Pandas

Loading Pandas

In order to use Pandas we must import it into our Python script alongside its perquisite Numpy. As both these are some commonly used, they are often imported as 2 letter names, so they can easily be referenced:

Python

Pandas DataFrame Basics

DataFrames have the basic form shown below. First of all there is the data which consists of a number of Indexes (Rows) of equal size. These Indexes (Rows) may be assigned a name (defaulting to numeric values) and each element in a Row will correspond to a Column which may also be assigned a name (also defaulting to numeric values). Pandas follows the convention behind Python of 0 order indexing, meaning the starting Row and Column is 0,0 respectively.

Creating a Data Frame

To create a DataFrame like above we must use:

Python

For convenience we can type each Row on a Separate Line. This creates the DataFrame which shows up in the Variable Explorer as expected:

The following code, could also be used.

Python

Note when labelling columns or indexes, it is advisable to avoid using spaces, use an underscore instead and also avoid starting the column or index name with a number:

Python

Having spaces or starting the column, index with a number will create issues when using dot indexing.

Looking at a DataFrames Information

Information about DataFrames can be obtained by using dot indexing. We have already created the variable mydataframe. If we type into the console the DataFrame name followed by a .

Python

Then press the [tab] key, we will be presented with a range of options. To the top we will see all the Column Names:

If instead of the columns used above, we use:

Python

Then we will see that the columns with the spaces and starting with a number do not show on the list because they cannot be dot indexed:

Going back to the original dataset, we can scroll through the list of other functions we can perform on the DataFrame:

When dot indexing is used there are no quotations around the text field being indexed and any input arguments are enclosed in ( )

As DataFrames can be quite large there are two functions, one called head and the other called tail which by default view the top 5 or bottom 5 Rows of a DataFrame, the number of Indexes (Rows) shown can be adjusted using the input argument, in this case it is left blank so defaults to 5:

Python
    C0  C1  C2  C3  C4  C5  C6
R0   1   2   3   4   5   6   7
R1   8   9  10  11  12  13  14
R2  15  16  17  18  19  20  21
R3  22  23  24  25  26  27  28
R4  29  30  31  32  33  34  35
Python
    C0  C1  C2  C3  C4  C5  C6
R1   8   9  10  11  12  13  14
R2  15  16  17  18  19  20  21
R3  22  23  24  25  26  27  28
R4  29  30  31  32  33  34  35
R5  36  37  38  39  40  41  42

In the case of this simple data set, there are only 6 rows so we print all the rows except the last one or zeroth one respectively.

We can also look at the datatyes abbreviated as dtypes. In this case all columns consist of exclusively numeric values so are int64:

Python
C0    int64
C1    int64
C2    int64
C3    int64
C4    int64
C5    int64
C6    int64
dtype: object

We can look up the index:

Python
Index(['R0', 'R1', 'R2', 'R3', 'R4', 'R5'], dtype='object')

We can look up the columns:

Python
Index(['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6'], dtype='object')

We can look up all the values:

Python
[[ 1  2  3  4  5  6  7]
 [ 8  9 10 11 12 13 14]
 [15 16 17 18 19 20 21]
 [22 23 24 25 26 27 28]
 [29 30 31 32 33 34 35]
 [36 37 38 39 40 41 42]]

Or we can look up the entire table using describe:

Python
<bound method NDFrame.describe of     C0  C1  C2  C3  C4  C5  C6
R0   1   2   3   4   5   6   7
R1   8   9  10  11  12  13  14
R2  15  16  17  18  19  20  21
R3  22  23  24  25  26  27  28
R4  29  30  31  32  33  34  35
R5  36  37  38  39  40  41  42>

Indexing

Since one is typically working with Columns in DataFrames, there are a number of ways to select a Column. For example using Dot Indexing, recall you can type in your DataFrame name and then dot . and then press tab to return a list of possible values:

In this case we are select C1 within mydataframe

Python
R0     2
R1     9
R2    16
R3    23
R4    30
R5    37
Name: C1, dtype: int64

Using this notation one may also type in the Column name followed by dot . and press [tab] again. In this case this gives a list of Rows that can be selected within C1:

In this case we are selecting index (row2) R2 within column C1 within mydataframe:

Python
16

This returns a scalar.

Note that the procedure can be continued and we can carry out a function on this scalar should we wish to do so. For example the Transpose, which is at the top. This would mean we carry out the Transpose of the 2nd Row of the first Column of mydataframe. And of course the transpose of a scalar would return the original scalar.

We can go further yet again and type in dot and then [tab] once again. Now we can carry out another function on the Transpose of the 2nd Row of the first Column of mydataframe. This gives a demonstration into the elegance of dot indexing:

On the hand one may select a Column by using an array of the Column Name:

Python
R0     2
R1     9
R2    16
R3    23
R4    30
R5    37
Name: C1, dtype: int64

Multiple Columns may be picked up using:

Python

Note the double, square bracket, the outside set is required to select an input argument into the dataframe in this case, mydataframe and the inside set close around the input Row Vector.

    C1  C2
R0   2   3
R1   9  10
R2  16  17
R3  23  24
R4  30  31
R5  37  38

A single Index can be selected by using loc and then selecting the Index (Row) and then Column:

Python
10

To select all values in a Row one may use the argument : for the Col:

Python
C0     8
C1     9
C2    10
C3    11
C4    12
C5    13
C6    14
Name: R1, dtype: int64

In addition to using loc and indexing by the numerical value of the index (row) and column. For R1 and C2, this is 2 and 3 respectively:

Python
10

Multiple values may be selected using:

Python
    C2  C3
R0   3   4
R1  10  11
R2  17  18

For the Index (Rows) the input value 0:3 means the 0th Index (Row) up to but not including the 3rd Index (Row) and likewise for the Columns selection the value 2:4 means the 2nd Column up to, but not including the 4th Column.

Replacing A Single Value

Supposing we wanted to update Index (Row) 0 and Column 1, we would need to select it to the left hand side and assign it to a new value on the right hand side:

Python

We could then update the Index (Row) and Column Names respectively:

Python

Here R0 is the old name and R0A is the new name.

Python

Likewise C1 is the old name and C1A is the new name.

Sort Values

In the case of our data, it was already sorted, lowest to highest across both rows and across columns however the value in R0A,C1A is now out of place. We can reorder the data using:

Python
Python

Sub-Selection by Conditional Logic

One can select a sub-selection of the data using conditional logic (use of less-than-or equal to <=, less-than <, equal ==. not equal~=, greater-than >, greater-than or equal to >=). In this case, we can specify less than 99.

Python

Transposing a DataFrame

It is possible to Transpose a DataFrame by using .T:

Python

This may make it easier to work with what were formerly the Indexes.

Of course, transposing the transpose, returns the original dataframe:

Python
Advertisements

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.