Python and MatPlotLib: Histogram Plotting

Perquisite Libraries

In this guide we will look at creating a histogram plot of randomly generated data. We will need to first start by loading three libraries, numpy, pandas and matplotlib.pyplot. These are typically loaded as np, pd and plt respectively.

Python

Configuring the Layout of Figures

Before creating any figures, you should adjust your preferences for how you wish to display figures. The default option is inline which means all figures will be printed to the Console as shown:

If instead you want the Figures to be shown as a separate Window, you can change the setting to Automatic. To do this go to Tools → Preferences:

To do this go to Tools → Preferences:

Next on the left hand menu select iPython console:

Select Graphics:

Change the setting from Inline to Automatic:

Select Apply:

Now go to Consoles and Restart the Kernal:

When rerunning your code, your figure will be in a separate window opposed to being inline within the Console:

Note Spyder Version 3.3 may give a stream of errors instead of making a plot. If you have this version (installed by default with the Anaconda March 2019 installer) you should close down Spyder and then update both Anaconda and Spyder. To do this open the Anaconda PowerShell Prompt and type in:

Python

Note it is also possible to toggle between the two settings without restarting the Kernal using the following commands:

Python

In these guides, the setting automatic will be applied and the figures will all be shown as separate windows.

Function figure

To create a new figure we can use the following function. Leaving the input argument empty will create a new figure:

Python

To view the figure we need to show it:

Python

If no figures are open this will be “Figure 1”. We can also specify the figure number using:

Python

Now that we have Figure 1000, if we once again type in:

Python

We will get Figure 1000 +1 i.e. Figure 1001

The figures can be closes using the x on the top right corner or by using the command close with the input argument being the figure number in our case 1, 1000 and 1001:

Python

The command:

Python

Will close all open figures.

Random Number Generators

In order to make this reproducible, we will first rest the seed of the random number generator to 0.

Python

Let us now have a look at the random normal function (randn). This should give a normal randomly distributed number (around the origin). We can create an array of 10 randomly distributed numbers.

Python

For a small set of numbers it is fine to view these as a vector however to understand the distribution it may be more useful to plot them as a histogram:

Histogram Plot

However it is also useful to show this data set as a histogram:

Python

Bins and Range

This creates a Histogram however there is insufficient data to see how the distribution works, to rectify this we can increase n. In the example above MatPlotLib has determined what it thinks is the lower and upper bound of the histogram bins and used ~10 bins by default. We can add additional arguments to set these.

Python

Here we can see the shape of a Normal distribution begin to take place. The data still isn’t good enough to be completely sure.

Colour and Transparency

We can measure three times and overlay the three plots on a single graph. To distinguish the three plots we can use the additional input argument color (US spelling) and for primary and secondary colours the full name (red, green, blue, magenta, yellow, cyan, white or black) or single letter color abbreviation (r,g,b,m,y,c,w,k) work as a string input argument:

Python

Now these figures appear to overlay however it is hard to see them because the last plotted histogram (blue) covers the first plotted histogram (green) which in term covers the zeroth plotted histogram. This can be amended with some transparency:

Python

This data overlays pretty well, I will modify it so a is -1 with respect to b and c is +1 with respect to b so you can see the effects of transparency in more detail:

Python

The colour names and single digit characters only work for the primary colours, secondary colours and white and black. Colours can also be expressed a set of three numerical values, one for each primary channel [r,g,b] values. Colours are often expressed as 8 bit numeric values ranging from between 0 to 255, the colour picker in Microsoft Word for instance uses these values:

However in the case of MatPlotLib, it uses floating point numbers between 0 and 1. Thus we can divide the 8 bit values by 255. Sometimes this is combined with the alpha parameter to give a set of 4 values [r,g,b,a] where the 4th value corresponds to the alpha value or transparency.

We can use these numeric values to return red, green and blue and apply the alpha values we had earlier.

Python

Another colouring system is called the Hexadecimal colouring system, in this colouring system each channel corresponds to 2 characters ranging from 0 to F (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F) and 16×16=256 (0-255 with 0 indexing). once again we have 3 rgb channels or 4 channels if we include the alpha. This reproduces the chart from earlier:

Python

In general people just look up the colour values and apply the one they want but for a more detailed explanation behind the fundamentals see

Line Width and Edge Colour

Moving back to a single chart and noting that the statistics aren’t good enough, we’ll increase the n, 100 fold to 10000 and we’ll look at also adding outlines to each bar by specifying additional arguments, edgecolor and linewidth.

Python

Now we can see it resembles a Normal Gaussian Function, that the lines of each bar are black and darkened, and the bars are the same colour as the green taken from Microsoft Word.

Pattern

We can change the hatching of the bar using the additional input argument hatch

Python

Toggling through the different hatch styles we get:

Labelling

To get a label for a dataset, for example the histogram plot of a. use the additional input argument label and input your label as a string enclosed in quotation marks. For the x label, y label and title, there are functions xlabel, ylabel and title which all take a string as an input argument. We can use the legend function to show the legend and use the input argument loc to specify the location of the legend. The location is also available as a string or a number. Unfortunately with the numerical input, it is implemented in the following way:

Location stringNumeric Value
'best'0
'upper right'1
'upper left'2
'lower left'3
'lower right'4
'right'5
'centre left'6
'centre right'7
'lower centre'8
'upper centre'9
'centre'10

Opposed to using the shape of the number square which would have made much more sense.

Python

You’ll notice the title above is quite long. If we want to separate it onto two lines we can use the special character \n. Note that we have also had to specify the value of n i.e. 10000 on line 1 and on line 9. We can instead reference the variable n we assigned on line 1, by typing in %i (i for integer) where we want the integer n specified. After we close the quotes we then put a % and enclose the variable we want to reference. In this case we only specify one variable.

Python

With this done we can change n=1000000 in line 1 and it will autoupdate:

Python

If we wanted to reference more we would need to put in more markers for them %i (integer), %f (float) or %s string and then at the end put in multiple inputs in the %( ).

Python

Grid Lines

One can also enable gridlines on their plot using the command grid, the first input argument b is a Boolean specifies whether they are enabled or disabled, the second input argument which specifies which gridline to be amended the minor, major or both minor or major gridlines. The color argument specifies the colour of the gridline and one may also change the linestyle for instance : for a dotted line (this will be covered in more detail when line plots are examined).

Python

Subplot

Supposing we wanted two Histograms on the same subplot for instance if we are interested in comparing the random distribution to the random normal distribution, we could use a subplot. For the subplot, we specify the number of rows and then the number of columns of the subplot, the third input argument is the position, with position 1 being the first subplot,position 2 the second and so on and so forth. The positions start to the top left and go row wise until the last element is reached:

Python

For comparison this is a 2 by 2 plot.

Python

Obviously you may be limited when it comes to axes, legends, titles etc. when you try to fit them into a smaller screen. So we will remove the titles and the legends and instead only plot a single supertitle using the function suptitle. We will also remove the x-axis of the top histogram as it is shared with the bottom histogram. The second dataset can be seen to be between 0 and 1 and it is clear that there are not enough bins so we will increase these also.

Python

We can see that the random function rand, distributes the numbers between 0 and 1 and with the exception to the end bins which go half under 0 and half over 1, give approximately equal values per bin. Whereas the random randn, distributes the numbers around the centre in the form of a Gaussian.

Advertisements