Python Pandas: Categorical Data

Data

We can either import our data from an Excel Spreadsheet.

Python

Or we can create this data from scratch using:

Python

We can then look at each column:

Python
0    Philip
1      Jess
2      Maya
3    Jordan
4      John
5    Roisin
6     Petra
7      Lisa
8     Simon
9     Peter
Name: Student, dtype: object
Python
0    A
1    C
2    B
3    D
4    F
5    F
6    F
7    C
8    D
9    B
Name: Score, dtype: object

We see that both are objects.

Setting a Column as Category

A New Category column can be created from an existing Column by using astype and setting it to category.

Python

Note it appears that dot notation doesn’t fully work here. The following code displays an error:

Python
student_scores.Score_as_Category=student_scores.Score
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

Alternatively the original Column can be replaced by assigning the output to the original Column:

Python

In this case the dot notation does work as there is no new column created.

Looking at the Datatype

The new Column can be readout:

Python
0    A
1    C
2    B
3    D
4    F
5    F
6    F
7    C
8    D
9    B
Name: Score as Category, dtype: category
Categories (5, object): [A, B, C, D, F]

We can also look at the datatype of all columns using dtypes

Python
Student                object
Score                category
Score_As_Category    category
dtype: object

Cuts – Integer Bins

For the next, section we want to start off with a slightly different dataset, one that is numeric. Once again we can import it from Excel:

Python

Or create it in Python manually:

Python

We can cut this data into 5 slices using:

Python
Python
0      (71.0, 80.0]
1      (62.0, 71.0]
2      (62.0, 71.0]
3      (53.0, 62.0]
4    (34.955, 44.0]
5    (34.955, 44.0]
6    (34.955, 44.0]
7      (62.0, 71.0]
8      (53.0, 62.0]
9      (62.0, 71.0]
Name: Slices, dtype: category
Categories (5, interval[float64]): [(34.955, 44.0] < (44.0, 53.0] < (53.0, 62.0] < (62.0, 71.0] < (71.0, 80.0]]

Selected Cuts

On the other hand we can specify our own criterion, for instance using the Grade lower and upper bounds as shown below:

Python

This gives the following:

Deleting a Column or Row

We no longer need the Slices column so can drop it using dot . indexing and drop

The Slices can be deleted using the dot indexing and drop

Python

Here the axis is set to 1 for columns, it would be set to 0 for indexes (rows).

Renaming Columns or Rows

We can see that the Categories are named by the lower and upper bound specified when the Selected_Slices were made from the cut:

The Selected_Slices can be renamed Grades

Python

The names of each Grade can be updated corresponding to the grades specified below:

Lets look at the Categories using:

Python
IntervalIndex([(0, 50], (50, 60], (60, 70], (70, 80], (80, 100]],
              closed='right',
              dtype='interval[int64]')

We can rename these categories using:

Python

Selection by Category

Supposing we wanted to create a new DataFrame which consisted of all Students with a Grade of C, we could use:

Python

Alternatively because our categories were defined as A>B>C>D>F we can use >F to obtain students that passed:

Python

Merging Categories

Supposing we want to merge the categories A, B, C and D as pass and F as fail in a new column. To do this we will need to copy the column and save it as a new column status

Python

Next we will need to create some new categories:

Python
Python
0    B
1    C
2    C
3    D
4    F
5    F
6    F
7    C
8    D
9    C
Name: Status, dtype: category
Categories (7, object): [F < D < C < B < A < fail < pass]

To reorder these, we can use:

Python
Python
0    B
1    C
2    C
3    D
4    F
5    F
6    F
7    C
8    D
9    C
Name: Status, dtype: category
Categories (7, object): [fail < F < pass < D < C < B < A]

We can locate all student_scores that have Grades greater than ‘F’ and assign the Grades to ‘pass’:

Python
Python

If we print this column out:

Python
0    pass
1    pass
2    pass
3    pass
4    fail
5    fail
6    fail
7    pass
8    pass
9    pass
Name: Status, dtype: category
Categories (7, object): [fail < F < pass < D < C < B < A]

We see we have all the previous categories that are no longer used. These can be removed using:

Python
0    pass
1    pass
2    pass
3    pass
4    fail
5    fail
6    fail
7    pass
8    pass
9    pass
Name: Status, dtype: category
Categories (2, object): [fail < pass]
Python
0    pass
1    pass
2    pass
3    pass
4    fail
5    fail
6    fail
7    pass
8    pass
9    pass
Name: Status, dtype: category
Categories (2, object): [fail < pass]
Advertisements

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.