seaborn

In the previous tutorials have already looked at the use of the matrix plotting library (matplotlib) which allowed us to create basic plots from data in a numeric Python (numpy) ndarray. We have also looked at using the Python and data analysis library (pandas) to look at plotting data from a dataframe data structure. seaborn is a Python data visualisation library which essentially bridges the pandas dataframe data structure with the matplotlib plotting library. The seaborn library acts essentially a wrapper around matplotlib that simplifies the code making it easy to plot data from dataframes and commonly used data visualisations.

matplotlib Recap

Before starting with seaborn, you should be comfortable with the Python Programming Language, numpy, pandas and matplotlib, for more details see my other Python guides:

Let's start with a simple script which uses numpy and matplotlib to make a plot of the sin and cos functions:

#%% import data science libraries
import numpy as np
import matplotlib.pyplot as plt
#%% generate data
xdata = np.linspace(-2*np.pi, 2*np.pi, 100)
ydata = np.sin(xdata)
ydata2 = np.cos(xdata)
#%% create plot using matplotlib
fig, ax = plt.subplots()
lines0 = ax.plot(xdata, ydata, label="y=sin(x)")
lines1 = ax.plot(xdata, ydata2, label="y=cos(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(bbox_to_anchor=(1.1, 0.5))
plt.tight_layout()

In the variable explorer we see we have fig which is a Figure object and ax which is an AxesSubplot object. A Figure object can contain one or more AxesSubplot objects. We also see we have two lists of Line2D objects. An AxesSubplot can contain multiple Line2D objects or other objects corresponding to visualisations of data.

Styles

If we also import:

import pandas as pd
import seaborn as sns

We can use the seaborn function set_style and keyword input argument style to override the default style used in matplotlib to "whitegrid":

sns.set_style(style="whitegrid")

If we do this leaving the rest of the existing code the same:

#%% import data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%% set style
sns.set_style(style="whitegrid")
#%% generate data
xdata = np.linspace(-2*np.pi, 2*np.pi, 100)
ydata = np.sin(xdata)
ydata2 = np.cos(xdata)
#%% create plot using matplotlib
fig, ax = plt.subplots()
lines0 = ax.plot(xdata, ydata, label="y=sin(x)")
lines1 = ax.plot(xdata, ydata2, label="y=cos(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(bbox_to_anchor=(1.1, 0.5))
plt.tight_layout()

We see that our plot is styled using a white background with a grid opposed to just a white background:

There are also other subtle differences in the stick labels and ytick labels. There are another three styles available. "white" has a white background and no grid:

sns.set_style(style="white")

"dark" has a dark background and no grid:

sns.set_style(style="dark")

"darkgrid" has a dark background and grid:

sns.set_style(style="darkgrid")

Color Palettes

When we use plot multiple times from the same AxesSubplot without any keyword input arguments. Each Line2D generated has a different color. Let's create 10 sin plots using different amplitudes:

#%% import data science libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#%% set style
sns.set_style(style="whitegrid")
#%% generate data
xdata = np.linspace(-2*np.pi, 2*np.pi, 100)
ydata = np.sin(xdata)
#%% create plot using matplotlib
fig, ax = plt.subplots()
ax.plot(xdata, ydata, linewidth="3", label="y=sin(x)")
ax.plot(xdata, 2*ydata, linewidth=3, label="y=2sin(x)")
ax.plot(xdata, 3*ydata, linewidth=3, label="y=3sin(x)")
ax.plot(xdata, 4*ydata, linewidth=3, label="y=4sin(x)")
ax.plot(xdata, 5*ydata, linewidth=3, label="y=5sin(x)")
ax.plot(xdata, 6*ydata, linewidth=3, label="y=6sin(x)")
ax.plot(xdata, 7*ydata, linewidth=3, label="y=7sin(x)")
ax.plot(xdata, 8*ydata, linewidth=3, label="y=8sin(x)")
ax.plot(xdata, 9*ydata, linewidth=3, label="y=9sin(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(bbox_to_anchor=(1.1, 0.8))
fig.tight_layout()

From the legend, we can see the order the lines were added alongside the color of these lines:

These colors are defined in the color palette. The default color palette used in seaborn is a subtle variation of the palette used in matplotlib called "tab10". The seaborn method color_palette will output these colors as a list of normalised [r, g, b] floats:

sns.color_palette()

color_palette has a keyword input argument palette which can be assigned to the name of the palette as a string. Its default value is "tab10".

sns.color_palette(palette="tab10")

If the keyword input argument as_cmap is assigned from the default False to True. This will display as a colormap in the console opposed to printing a list of normalised (r, g, b) tuples:

sns.color_palette("tab10", as_cmap=True)

There are a number of other palettes such as "pastel":

sns.color_palette("pastel")

The seaborn method set_palette can be used with the keyword input argument palette to set the palette. If the keyword input argument is not specified palette="tab10" will be used setting the default palette. Let's change the palette to "pastel":

sns.set_palette(palette="pastel")
#%% import data science libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#%% set style and palette
sns.set_style(style="whitegrid")
sns.set_palette(palette="pastel")
#%% generate data
xdata = np.linspace(-2*np.pi, 2*np.pi, 100)
ydata = np.sin(xdata)
#%% create plot using matplotlib
fig, ax = plt.subplots()
ax.plot(xdata, ydata, linewidth="3", label="y=sin(x)")
ax.plot(xdata, 2*ydata, linewidth=3, label="y=2sin(x)")
ax.plot(xdata, 3*ydata, linewidth=3, label="y=3sin(x)")
ax.plot(xdata, 4*ydata, linewidth=3, label="y=4sin(x)")
ax.plot(xdata, 5*ydata, linewidth=3, label="y=5sin(x)")
ax.plot(xdata, 6*ydata, linewidth=3, label="y=6sin(x)")
ax.plot(xdata, 7*ydata, linewidth=3, label="y=7sin(x)")
ax.plot(xdata, 8*ydata, linewidth=3, label="y=8sin(x)")
ax.plot(xdata, 9*ydata, linewidth=3, label="y=9sin(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(bbox_to_anchor=(1.1, 0.8))
fig.tight_layout()

Other common qualitative palettes are "Set1" (9 colors), "Set2" (8 colors) and "Set3" (12 colors). If the number of colors in the palette is exceed, it loops back round to the beginning:

sns.color_palette(palette="Set1")
sns.color_palette(palette="Set2")
sns.color_palette(palette="Set3")

There are also the circular also the "hls" hue, saturation, lightness and "husl" which is the human friendly saturation lightness:

sns.color_palette(palette="hls", as_cmap=True)
sns.color_palette(palette="husl", as_cmap=True)

Sometimes when the data is related to one another i.e. following a series like the dataset shown where the amplitude increase for each line, it is appropriate to use a sequential color palette:

sns.color_palette(palette="magma", as_cmap=True)

Usually this will only work with about 6 different Lines2D before it repeats, looping back around and losing the effect of using the sequential color palette:

There are a few additional palette functions such as dark_palette which blends a specified color with black, light_palette which blends as specified color with white. These have the keyword arguments n_color and as_cmap as seen before as well as reverse.

sns.light_palette(color="#00b050", n_colors=10, as_cmap=True)
sns.dark_palette(color="#00b050", n_colors=10, as_cmap=True)
sns.light_palette(color="#00b050", n_colors=10, as_cmap=True, reverse=True)
sns.dark_palette(color="#00b050", n_colors=10, as_cmap=True, reverse=True)

To apply the palette, save it as a variable and assign the keyword input argument in set_palette to the variable:

green_light = sns.light_palette(color="#00b050", n_colors=10)
sns.set_palette(palette=green_light)

We can use the blend_palette to blend a list of colors. For example if we wanted to make a palette of the standard_colors in Microsoft Office:

sns.blend_palette(colors=["#c00000", "#ff0000", "#ffc000", "#ffff00", "#92d050", "#00b050", "#00b0f0", "#0070c0", "#002060", "#7030a0"], n_colors=10, as_cmap=True)
standard_colors = sns.blend_palette(colors=["#c00000", "#ff0000", "#ffc000", "#ffff00", "#92d050", "#00b050", "#00b0f0", "#0070c0", "#002060", "#7030a0"], n_colors=10)
sns.set_palette(palette=standard_colors)

The above required the use of hexadecimal color codes. These can be found in Microsoft Office by selecting a color and then selecting More Colors… and then selecting the Custom tab. The red, green and blue values are found, these are 8 bit 0-255 values that need to be normalised through by 255 for matplotlib to recognise them as a tuple of (r, g, b) floats. The hexadecimal value is also displayed at the bottom:

as (r,g,b)

The xkcd gives a string to 954 commonly used colors, analogous to the names of tins of paints found in hardware stores. These can be seen on the xkcd website:

seaborn has a dictionary xkcd_rgb which use the color name as a key and returns the hex value. For example:

sns.xkcd_rgb["sky blue"]

seaborn also has the function xkcd_palette which can be used to create a palette from a list of these color names. This function does not have the keyword input arguments n_colors (as this is specified from the list provided) or as_cmap unfortunately:

xkcd_colors = sns.xkcd_palette(colors=["purple", "green", "blue", "pink", "brown", "red", "light blue", "teal", "orange", "light green"])
sns.set_palette(palette=xkcd_colors)

The seaborn method mpl_palette can be used to return a list of tuples used in a matplotlib colormap:

sns.mpl_palette("tab20", n_colors=20, as_cmap=True)

More details about matplotlib colormaps is available:

There are a number of functions such as choose_diverging_palette, choose_light_palette, choose_dark_palette, choose_colorbrewer_palette, choose_cube_helix_palette, and choose_cubehelix_palette. These functions are designed to work with interactive Python notebooks typically used in the JupyterLab IDE and aren't applicable to Spyder.

The palettes module can also be accessed:

sns.palettes

pandas Recap and Example Datasets

You should be comfortable with using pandas to create a dataframe or using pandas to read in a csv or xlsx file and create a dataframe from it. You should also be familar with the concept of categorical data. I will create a basic dataframe assigned to the variable name dataframe from scratch. It will have series of linearly spaced xdata and ydata however some noise will be added to the ydata and it will have a categorical series even_odd which will be the category "even" or "odd" corresponding to the value in xdata:

#%% import data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy.random as random
random.seed(0)
#%% set style and palette
sns.set_style(style="whitegrid")
sns.set_palette(palette="tab10")
#%% create data
dataframe = pd.DataFrame({"xdata": np.arange(start=0, stop=11, step=1),
                          "ydata": np.arange(start=0, stop=11, step=1) + random.randn(11)})
dataframe["even_odd"] = dataframe.xdata % 2 == 0
dataframe["even_odd"] = dataframe.even_odd.astype("category")
dataframe["even_odd"] = dataframe.even_odd.cat.rename_categories({True: "even", False: "odd"})

The dataframe displays in the variable explore and looks the following:

We can check the datatypes of each series in the dataframe using the dataframe attribute dtypes:

dataframe.dtypes

For the sin data we used before, we can also construct a dataframe which contains a categorical series amplitude for what was previously each a separate numpy array of ydata. For convenience we will use a for loop to concatenate these to the dataframe:

data = pd.DataFrame({"x": np.linspace(-2*np.pi, 2*np.pi, 100),
                     "y": np.sin(np.linspace(-2*np.pi, 2*np.pi, 100)),
                     "amplitude": np.ones(100)})
for idx in range(2, 11, 1):
    data = pd.concat([data, pd.DataFrame({"x": np.linspace(-2*np.pi, 2*np.pi, 100),
                                          "y": idx*np.sin(np.linspace(-2*np.pi, 2*np.pi, 100)),
                                          "amplitude": idx*np.ones(100)})])

data["amplitude"] = data.amplitude.astype("category")
data = data.reset_index(drop=True)

We can see this dataframe data has 1000 rows in the variable explorer:

If we arrange by x, we see that we have ten values inclusive of 1 to exclusive of 11.

And have a look at the datatype of each series using the attribute dtypes:

data.dtypes

Example Datasets

For convenience a number of practice datasets in the form of dataframes are inbuilt into seaborn. The names of these datasets can be accessed by using the function get_dataset_names:

sns.get_dataset_names()

Flights Dataset

These are loaded using the seaborn method load_dataset and the keyword input argument name is assigned to the name of the dataset:

flights = sns.load_dataset(name="flights")

Typically the name of the dataset is used as the variable name for the dataframe. We can view flights in the variable explorer:

We have to numeric series, year corresponding to the year and passenger corresponding to the number of passengers. There is also the categorical series month and 12 months for each year:

And have a look at the datatype of each series using the attribute dtypes:

flights.dtypes

Exercise Dataset

Let's load in the exercise diet which is a dataset that compares the performance of participants comparing two different diets:

exercise = sns.load_dataset(name="exercise")

The exercise dataset has the numeric series id where each participant is assigned a unique id and pulse. There are three categorical series kind which reflects an exercise level and time which reflects the time within the exercise level and finally diet which reflects the diet of the participants:

exercise.dtypes

Tips Dataset

Let's load the dataset tips which a dataset involving customers in a restaurant:

tips = sns.load_dataset(name="tips")

This has three numeric series corresponding to the total bill, tip and party size. There are also the categories sex, smoker, day and time of day:

tips.dtypes

Taxis Dataset

Let's load in the taxis dataset which is a dataset that compares two brands of taxi yellow and green:

taxis = sns.load_dataset(name="taxis")

This dataset has two time series pickup and dropoff times. It has 5 numeric series such as passenger, distance, fare, tip, tolls and total. It has 5 categorical series such as the color i.e. brand of the taxi, payment method, pickup_zone, dropoff_zone, pickup_borough and dropoff_borough:

taxis.dtypes

Iris Dataset

iris = sns.load_dataset(name="iris")

This dataset has four numeric series which measure the physical features of the iris such as the sepal length, sepal width, petal length and petal. It has a categorical series, species correspond to one of the three species of iris plants.

iris.dtypes

Penguins Dataset

Let's load in the penguins dataset which is a dataset that compares physical features of three species of penguins.

penguins = sns.load_dataset(name="penguins")

This dataset has four numeric series which measure the physical features of the penguins such as the bill length, bill depth, flipper length and body mass. It has three categorical series, species corresponding to one of the three species of the penguins, island the penguins live in and sex male or female.

penguins.dtypes

seaborn Modules

If we type in:

sns.

We see a large list of the most commonly used functions:

If we type in:

sns.palettes.

We get only the functions related to palettes, the most commonly used were demonstrated above however some additional functions are available:

seaborn has a number of other modules which incorporate common plots correspond to data which falls under the modules name:

  • relational
  • regression
  • distributions
  • categorical
  • matrix

Relational Plots

The simplest module is the relational module and the two plotting functions commonly used under this module are the lineplot and scatterplot:

Note that these can be called from sns directly or via the sns module relation i.e. the following are the same:

sns.lineplot()
sns.relational.lineplot()

These two plot types are wrappers around the matplotlib pyplots plot and scatter respectively but optimised to receive data from dataframes.

Notice that there are the following keyword input arguments data which is assigned to the name of the dataframe and as a relational plot requires x and y data, there are the keyword input arguments x and y respectively which should be assigned to the string of the x and y column names.

Let's have a look at the simple dataframe called dataframe that we created before:

fig, ax = plt.subplots()
lineplot = sns.lineplot(data=dataframe, x="xdata", y="ydata")

Notice that the xlabel and ylabel are automatically generated from the dataframe series names.

We can use identical syntax for a scatterplot:

fig, ax = plt.subplots()
scatterplot = sns.scatterplot(data=dataframe, x="xdata", y="ydata")

Hue

The keyword input argument hue can be assigned to a categorical series. In this case we can assign to a category. Doing so will split the data by the series and apply each category value with a different color specified in the order of the palette:

fig, ax = plt.subplots()
scatterplot = sns.scatterplot(data=dataframe, x="xdata", y="ydata", hue="even_odd")

Notice that the legend automatically populates using the legend title as the name of the selected categorical series and displays the categories:

The keyword argument palette can be used to assign a specific palette. This will override the palette used for this specific plot opposed to the one used using the seaborn function set_palette:

fig, ax = plt.subplots()
scatterplot = sns.scatterplot(data=dataframe, x="xdata", y="ydata", hue="even_odd", palette="Set3")

We can create a lineplot of our sin data stored in the dataframe called data using:

fig, ax = plt.subplots()
lineplot = sns.lineplot(data=data, x="x", y="y", hue="amplitude")

We can move the location of the legend using the standard matplotlib commands:

fig, ax = plt.subplots()
lineplot = sns.lineplot(data=data, x="x", y="y", hue="amplitude")
ax.legend(bbox_to_anchor=(1.1, 0.8))
fig.tight_layout()

AxesSubplot

Note that the seaborn plotting functions must be called using the function from the seaborn library or relevant seaborn module respectively. No additional methods for these plots are given to an AxesSubplot.

Let's create a lineplot of the data in the flights dataframe:

#%% create data
flights = sns.load_dataset("flights")
#%% plot data
fig, ax = plt.subplots()
lineplot = sns.lineplot(data=flights, x="year", y="passengers", hue="month")
ax.legend(bbox_to_anchor=(1.1, 0.8))
fig.tight_layout()

If we have a look in the variable explorer, we see that lineplot is the type AxesSubplot:

And in actual fact:

ax == lineplot

The seaborn function lineplot therefore modifies the existing Axesubplot directly. If we have a look at the AxesSubplot attribute lines, we see it is occupied by the Line2D objects created by the seaborn lineplot method:

ax.lines

Because we cannot call a seaborn plotting function from an AxesSubplot, we must use the matplotlib pyplot command sca to set the current AxesSubplot. Any seaborn plotting function which modifies an AxesSubplot, will modifty the selected AxesSubplot. Let's demonstrate this by use of matplotlib subplots setting the keyword input argument nrows and ncols to 2 and 1 giving 2 rows by 1 column of subplots respectively. Recall that we will access these aubplots using their numeric index in this case ax[0] and ax[1].

fig, ax = plt.subplots(nrows=2, ncols=1)
plt.sca(ax[0])
sns.lineplot(data=flights, x="year", y="passengers", hue="month", legend=None)
plt.sca(ax[1])
sns.scatterplot(data=flights, x="year", y="passengers", hue="month", legend=None)
fig.tight_layout()

This gives the data plotted on each subplot as expected.

Note the variable names lineplot and scatterplot were not assigned as these can now be accesed using ax[0] and ax[1] respectively.

Regression Plots

If we go to the regression module and have a look at the plotting functions available we see there is regplot, which essentially a scatter plot that is fitted to a trendline using as a linear model or a curve which is a higher order polynomial. The residplot gives a plot of the associated residuals.

Let's fit this to our dataframe called dataframe which has linearly spaced x data and linearly spaced ydata with noise added:

fig, ax = plt.subplots(nrows=2, ncols=1)
plt.sca(ax[0])
sns.regplot(data=dataframe, x="xdata", y="ydata")
plt.sca(ax[1])
sns.residplot(data=dataframe, x="xdata", y="ydata")

The residual (bottom subplot) is essentially the difference the datapoint and the blue line (top subplot):

The keyword input argument order can be changed from the default value of 1 (linear equation) to a higher order polynomial such as 2 for a quadratic equation:

fig, ax = plt.subplots(nrows=2, ncols=1)
plt.sca(ax[0])
sns.regplot(data=dataframe, x="xdata", y="ydata", order=2)
plt.sca(ax[1])
sns.residplot(data=dataframe, x="xdata", y="ydata", order=2)
fig.tight_layout()

seaborn does not include the options to include the equation of the equation on the regplot. The seaborn developers are against incorporating this despite it being widely requested, as they see users of other plotting packages place poorly understood equations on graphs. Under the hood regplot, uses numpy polynomial functions. We can use these to get the equation.

from numpy.polynomial import Polynomial

f1 = Polynomial.fit(dataframe.xdata, dataframe.ydata, deg=1)
str(f1.convert())

f2 = Polynomial.fit(dataframe.xdata, dataframe.ydata, deg=2)
str(f2.convert())

As expected, the data fits to a linear model fairly well (as it was created from linear data). The coefficient of the x squared term is low suggesting that this term is not required to fit the data.

FacetGrid

A plot type that seems similar to regplot is the lmplot (linear model plot). This can only fit a linear model to the data.

Let's have a look at using this on the same data as before:

fig, ax = plt.subplots(nrows=1, ncols=1)
lmplot = sns.lmplot(data=dataframe, x="xdata", y="ydata")

Notice what happens, we get a Figure with an AxesSubplot that has no data (top line) and another Figue with the lmplot on it:

If we have a look on the variable explorer, we see that lmplot is not an AxesSubplot but a new object called a FacetGrid:

A FacetGrid can essentially be conceptualised as an object on the Figure level that can hold one or multiple AxesSubplots. If we have a look at lmplot again we can see there are additional keyword input arguments col, row and col_wrap which can be used to create multiple AxesSubplots:

This image has an empty alt attribute; its file name is image-365.png

If we use the keyword input argument col or row instead of hue and set the value to a categorical series. We can create separate subplots:

lmplot = sns.lmplot(data=dataframe, x="xdata", y="ydata", col="even_odd")
lmplot = sns.lmplot(data=dataframe, x="xdata", y="ydata", row="even_odd")

This can be seen in better detail using the flights dataset:

flights = sns.load_dataset(name="flights")
lmplot = sns.lmplot(data=flights, x="year", y="passengers", col="month")

If only one categorical series is listed as a col and row is unused, the keyword input argument col_wrap can be assigned to an integer specifying how many AxesSubplotscols to display before wrapping around to a new row:

flights = sns.load_dataset(name="flights")
lmplot = sns.lmplot(data=flights, x="year", y="passengers", col="month", col_wrap=4)

We can use the larger taxis dataset to have a look at a lmplot of fare with respect to distance using hue to correspond to the brand color of the taxis. Because we know the color brands are yellow and green, we can create our own custom palette yellowgreenpalette for this case and apply it. We can assign col to the number of passengers and set a column wrap of 3:

taxis = sns.load_dataset(name="taxis")
yellowgreenpalette = sns.blend_palette(colors=["#ffff00", "#00b050"], n_colors=2)
lmplot = sns.lmplot(data=taxis, x="distance", y="fare", hue="color", palette=yellowgreenpalette, col="passengers", col_wrap=3)

The FacetGrid has a number of methods or attributes. We can access these by typing in:

lmplot.

Notable we can access the Figure and AxesSubplots using the attributes figure and axes respectively:

fig = lmplot.figure
ax = lmplot.axes
ax

For most other plot types there are plotting functions that output AxesSubplot and similar plotting functions that output a FacetGrid. Modern seaborn documentation has tried to make it clear what each plotting function will output as this previously lead to a lot of confusion for beginners.

Distribution Plots

Let's have a look at the distributions module using:

sns.distributions.

There are five plot types. 4 rugplot, histplot, kdeplot and ecdfplot which output AxesSubplots and displot which outputs a FacetGrid.

Let's have a look at the random normal distribution as an example distribution that we are already familiar with. To get started we will use a very small number of points 10:

#%% import data science libraries
import numpy as np
import numpy.random as random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
random.seed(0)
#%% create random data
data = pd.DataFrame({"x": random.randn(10)})

If we have a look at the variable explorer we can see that we have a single series "x" that has 10 random normally distributed values:

We can create a rugplot using:

fig, ax = plt.subplots()
sns.rugplot(data=data, x="x", height=0.5, color="#ff0000")

And we can see all a rugplot does is create a line for each datapoint. In a rugplot the height is arbitrary and each datapoint has the same height:

We can also create a histplot which groups the data along the x axis into a histogram of constant bin sizes and plots the number of datapoints which fall within each bin on the y axis and we can see precisely what's going on by looking at the bin width and comparing the histplot to the above rugplot:

fig, ax = plt.subplots()
sns.histplot(data=data, x="x", bins="auto", color="#ff0000")

We also have the kdeplot which calculates and plots a kernel density estimation function:

fig, ax = plt.subplots()
sns.kdeplot(data=data, x="x", color="#ff0000")

To understand a kdeplot, let's have a look at how one is constructed from a rugplot:

#%% create figure with axes
fig, ax = plt.subplots(nrows=3, ncols=1)

#%% create a rugplot
plt.sca(ax[0])
ax0 = sns.rugplot(data=data, x="x", height=0.5, color="#ff0000")
ax0.set_xlim([-4, 4])
plt.setp(ax0.collections[0], colors=sns.color_palette())

#%% draw a gaussian at each datapoint
plt.sca(ax[1])

for idx in range(len(data.x)):
    mu = data.x[idx]
    variance = 1/len(data.x)
    sigma = variance**0.5
    x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
    ax[1].plot(x, stats.norm.pdf(x, mu, sigma)/len(data.x))
    
ax[1].set_xlim([-4, 4])

#%% create a kde plot
plt.sca(ax[2])
ax2 = sns.kdeplot(data=data, x="x", color="#ff0000")
ax2.set_xlim([-4, 4])

Essentially each datapoint in the rugplot is broadened into a Gaussian distribution and all these individual Gaussian distributions are added up to create a kdeplot:

If we now increase the number of datapoints to 1000 we can compare these three distribution plots. For the rugplot we will set a low alpha value of 0.01 making it easier to see how much datapoints overlap and for the histplot we will set the bins to 50:

#%% create figure with axes
fig, ax = plt.subplots(nrows=3, ncols=1)

#%% create a rugplot
plt.sca(ax[0])
sns.rugplot(data=data, x="x", height=0.5, color="#ff0000", alpha=0.01)
ax[0].set_xlim([-5, 5])
#%% create a histplot
plt.sca(ax[1])
sns.histplot(data=data, x="x", bins=50, color="#ff0000")
ax[1].set_xlim([-5, 5])
#%% create a kdeplot
plt.sca(ax[2])
sns.kdeplot(data=data, x="x", color="#ff0000")
ax[2].set_xlim([-5, 5])

And as expected, the histplot and kdeplot begin to resemble one another:

If the number of datapoints is increased to 100000 the histplot and kdeplot almost perfectly overlap:

The empirical curve distribution function ecdfplot normalises the data and plots the proportion of datapoints between the lower x axis boundary and the current x axis value. This can be seen for a low number of datapoints by comparing a rugplot to a ecdfplot

#%% create figure with axes
fig, ax = plt.subplots(nrows=2, ncols=1)
#%% create a rugplot
plt.sca(ax[0])
sns.rugplot(data=data, x="x", height=0.5, color="#ff0000", alpha=1)
ax[0].set_xlim([-3, 3])
#%% create a ecdfplot
plt.sca(ax[1])
sns.ecdfplot(data=data, x="x")
ax[1].set_xlim([-3, 3])

You can see each step on the ecdfplot corresponds to a datapoint on the rugplot:

Increasing the datapoints to 1000 and 100000 gives an ecdfplot like the following and it is clear to see that the central value for the normally distributed function is 0.5 as expected and the proportion below the central value and the proportion above the central value are essentially mirrored:

Let's now compare the random normal and random distributions:

#%% import data science libraries
import numpy as np
import numpy.random as random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
random.seed(0)
#%% create random data
data = pd.DataFrame({"x": random.randn(100000)})
data2 = pd.DataFrame({"x": random.rand(100000)})
#%% create figure with axes
fig, ax = plt.subplots(nrows=4, ncols=1)
#%% create rugplots
plt.sca(ax[0])
sns.rugplot(data=data, x="x", height=0.5, alpha=0.01, label="randn")
sns.rugplot(data=data2, x="x", height=0.5, alpha=0.01, label="rand")
ax[0].set_xlim([-5, 5])
#%% create a histplot
plt.sca(ax[1])
sns.histplot(data=data, x="x", bins=50, label="randn")
sns.histplot(data=data2, x="x", bins=50, label="rand")
ax[1].set_xlim([-5, 5])
#%% create a kdeplot
plt.sca(ax[2])
sns.kdeplot(data=data, x="x")
sns.kdeplot(data=data2, x="x")
ax[2].set_xlim([-5, 5])
#%% create a ecdfplot
plt.sca(ax[3])
sns.ecdfplot(data=data, x="x")
sns.ecdfplot(data=data2, x="x")
ax[3].set_xlim([-5, 5])

The displot outputs a FacetGrid and is on the Figure level. It has a keyword input argument type which can be assigned to the three plot types seen above "hist", "kde" and "ecdf" respectively and has a default value of "hist". The rugplot can be shown at the bottom of the plot by overriding the keyword input argument rug from the default False to True.

The keyword input arguments hue, row, col and col_wrap behave in the same manner as observed for the lmplot.

The default kind is "hist", the optional keyword input argument bins is assigned to "auto" but can be assigned to an integer value as observed above when using histplot:

displot = sns.displot(data=data, x="x")

We can change the kind to "kde":

displot = sns.displot(data=data, x="x", kind="kde")

Or we can change it to "ecdf":

displot = sns.displot(data=data, x="x", kind="ecdf")

Let's have a look at the penguins dataset. Recall that this dataset has four numeric series which measure the physical features of the penguins such as the bill length, bill depth, flipper length and body mass and it has three categorical series, species corresponding to one of the three species of the penguins, island the penguins live in and sex male or female.

Let's create two displots, displot1 which display a histogram of x assigned to the numeric series "bill_length_mm", hue assigned to the categorical series "species" and col assigned to "sex". displot2 will instead display a histogram of x assigned to the numeric series "bill_depth_mm" with all other keyword input arguments assigned to the same values as in displot1:

penguins = sns.load_dataset(name="penguins")
displot1 = sns.displot(data=penguins, x="bill_length_mm", hue="species", col="sex")
displot2 = sns.displot(data=penguins, x="bill_depth_mm", hue="species", col="sex")

For displot we have the keyword input argument x, as well as the keyword input argument y. This gives us a 2D plot where the intercept of a bar graph on both axes is displayed as a square. The intensity of squares at each datapoint is highlighted by the shading of each square (darker shading = higher intensity). In 2 dimensions we can separate out the three species, far easier than using 1 dimension:

displot3 = sns.displot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", col="sex")

We can also view this as a kde by assigning the keyword input argument kind to "kde":

displot4 = sns.displot(data=penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", hue="species", col="sex")

AxisGrid Plots

A collection of different but related plots is an axisgrid and if we type in:

sns.axisgrid.

We can view the plot types available such as the jointplot and the pairplot.

Let's have a look at the jointplot. This essentially gives a 2d displot in x and y, alongside the 1d displot for x and y as separate subplots in the FacetGrid. This seaborn plotting function has the hue keyword input argument but does not have the cols, rows or col_wrap keyword input arguments because this FacetGrid already has prescribed subplots:

jointplot = sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species")

This has the keyword input arguments kind which has a default of "scatter but can be changed to "hist" and "kde":

jointplot2 = sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", kind="hist")
jointplot3 = sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", kind="kde")

We have seen a 1 dimensional displot of a single x numerical series and a 2 dimensional displot of an x and y numerical series. In the case of the penguins dataset, there are four numerical series and thus we could create multiple displots of x and y pairs however we can automatically do this with a pairplot. Note for pairplot there is no keyword input argument x or y as by default a pairplot of all numeric series is created:

pairplot1 = sns.pairplot(data=penguins, hue="species")

The pairplot has a keyword input argument kind which specifies the kind of 2D plot in the off-diagonals. kind has the default value of "scatter" but can be changed to "hist" or "kde". The pairplot also has a keyword input argument diag_kind which species the kind of 1D plot in the diagonal. The default is "hist" but this kind be changed to "kde" or "ecdf" (although it is rare to use "ecdf" as it is hard to visualise how this relates to the 2D plots:

pairplot2 = sns.pairplot(data=penguins, hue="species", kind="hist", diag_kind="hist")
pairplot3 = sns.pairplot(data=penguins, hue="species", kind="kde", diag_kind="kde")

As the 2D subplots are essentially mirrored across the diagonal, we can also use the keyword input argument corner and assign it from the default value False to True:

pairplot4 = sns.pairplot(data=penguins, hue="species", kind="kde", diag_kind="kde", corner=True)

Categorical Plots

A number of categorical plots are available to compare distributions of categorical data with one another. If we type in:

sns.categorical.

We see a list of these categorical plots such as the boxplot, violinplot, striplot and swarmplot which all output an AxesSubplot.

Let's have a look at creating a boxplot:

The boxplot has the keyword input arguments data, x and y. x is used if a horizontal boxplot is wanted and y is used if a vertical boxplot is desired. Let's have a look at this using the exercise data:

exercise = sns.load_dataset("exercise")
fig, ax = plt.subplots()
sns.boxplot(data=exercise, x="pulse")
fig, ax = plt.subplots()
sns.boxplot(data=exercise, y="pulse")

In either case we see the box which contains 50 % if values with a line representing the median. The whiskers show the lower and upper bound excluding any outliers which are shown as black diamonds.

It is common to assign x to a categorical series and y to a numerical series. In this case we can assign x to "kind," and y to "pulse" which will give a boxplot of the pulse with respect to each exercise type:

fig, ax = plt.subplots()
sns.boxplot(data=exercise, x="kind", y="pulse")

We can use the keyword input arguments hue="diet", hue_order=["low fat", "no fat"] and palette to [sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]] to compare the two different exercise types:

fig, ax = plt.subplots()
sns.boxplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

A violinplot can be used in place of a boxplot. The side views of the violinplot shows mirrored kdes:

fig, ax = plt.subplots()
sns.violinplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])
ax.legend(loc="upper left")

When comparing two values for hue, the keyword input argument split can be assigned from the default value False to True:

fig, ax = plt.subplots()
sns.violinplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]], split=True)
ax.legend(loc="upper left")

We can also create a strip plot or swarm plot of the data. Both of these are categorical scatter plots. There is a subtle difference between the two and the swarm plot is supposed to be slightly better at avoiding plotting overlapping datapoints from different categories:

fig, ax = plt.subplots()
sns.swarmplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])
ax.legend(loc="upper left")

The keyword input argument dodge can be assigned from the default of False to True to offset each hue of swarm plot.

fig, ax = plt.subplots()
sns.swarmplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]], dodge=True)
ax.legend(loc="upper left")

Instead of using a boxplot the data can also be plotted using a bar plot:

fig, ax = plt.subplots()
sns.barplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

In such a case, the bars begin at 0 and the black lines on the top of each bar, indicate the errors in each value:

A point plot will indicate the data shown in the bar plot and associated error as a single point:

fig, ax = plt.subplots()
sns.pointplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

A count plot will count the number of times discrete variables occur. For example if x is assigned to "pulse", the y values will be the number of counts of each pulse value for each category:

sns.countplot(data=exercise, x="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

The catplot is a categorical plot in the form of a FacetGrid:

catplot1 = sns.catplot(data=exercise, x="kind", y="pulse", hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

The catplot has the keyword input argument kind, which has a default value of "strip" and displays a strip plot. This can be changed to "swarm", "box", "violin", "point", "bar" and "count" to change the plot type to a swarm plot, box plot, violin plot, point plot, bar plot and count plot respectively. The additional keyword input arguments for each plot type is available. For example if the kind is "violin", the keyword input argument split is available:

catplot2 = sns.catplot(data=exercise, x="kind", y="pulse", kind="violin", split=True, hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

The FacetGrid keyword input arguments row, col and col_wrap are available which will display subplots. We can for example assign col to the categorical series "time":

catplot3 = sns.catplot(data=exercise, x="kind", y="pulse", col="time", kind="violin", split=True, hue="diet", hue_order=["low fat", "no fat"], palette=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]])

Matrix Plots

A number of matrix plots are available to compare numerical series and rows with one another. If we type in:

sns.matrix.

We see a list of these matrix plots such as the heatmap and clustermap. Let's have a look at the flights dataset:

flights = sns.load_dataset(name="flights")

To use matrix plots, the data in the dataframe has to be in the matrix format. This can be done by using the pandas function pivot:

flights_mat = pd.pivot(data=flights, index="month", columns="year", values="passengers")

Notice that the values in the matrix i.e. the number of passengers are all related to one another and the view in the Spyder variable explorer uses color-coding so we can visualise the differences in intensity of the data which is in this case the number of passengers.

We can view this matrix with a sequential colormap using a heatmap.

The heatmap output an AxesSubplot:

fig, ax = plt.subplots()
sns.heatmap(data=flights_mat)

We can optionally set the minimum and maximum value in the colormap using the keyword input arguments vmin and vmax:

fig, ax = plt.subplots()
sns.heatmap(data=flights_mat, vmin=0, vmax=1000)

We can change the colormap, using the keyword input argument cmap and setting it for example to "viridis":

fig, ax = plt.subplots()
sns.heatmap(data=flights_mat, cmap="viridis")

Or "bone":

fig, ax = plt.subplots()
sns.heatmap(data=flights_mat, cmap="bone")

We can also specify our own color palette for example:

fig, ax = plt.subplots()
sns.heatmap(data=flights_mat, cmap=sns.blend_palette(colors=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]], n_colors=30))

The clustermap is a hierarchically-clustered heatmap. Columns that are more similar to one another and rows that are more similar with oen another are likewise grouped together. The clustermap also outputs a ClusterGrid which operates on the Figure level:

clustermap = sns.clustermap(data=flights_mat)

The summer moths for example are grouped together at the bottom as this is when there is the highest number of flights. The years are roughly in order a the overall trend has been an increase in flights, year on year.

We can also apply a custom colormap to this plot type using the keyword input argument cmap:

clustermap2 = sns.clustermap(data=flights_mat, cmap=sns.blend_palette(colors=[sns.xkcd_rgb["sky blue"], sns.xkcd_rgb["hot pink"]], n_colors=30))

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.