Table of contents
In this guide we will look at creating a histogram plot of randomly generated data. We will need to first start by loading three libraries, numpy, pandas and matplotlib.pyplot. These are typically loaded as np, pd and plt respectively.
Configuring the Layout of Figures
Before creating any figures, you should adjust your preferences for how you wish to display figures. The default option is inline which means all figures will be printed to the Console as shown:
If instead you want the Figures to be shown as a separate Window, you can change the setting to Automatic. To do this go to Tools → Preferences:
To do this go to Tools → Preferences:
Next on the left hand menu select iPython console:
Change the setting from Inline to Automatic:
Now go to Consoles and Restart the Kernal:
When rerunning your code, your figure will be in a separate window opposed to being inline within the Console:
Note Spyder Version 3.3 may give a stream of errors instead of making a plot. If you have this version (installed by default with the Anaconda March 2019 installer) you should close down Spyder and then update both Anaconda and Spyder. To do this open the Anaconda PowerShell Prompt and type in:
Note it is also possible to toggle between the two settings without restarting the Kernal using the following commands:
In these guides, the setting automatic will be applied and the figures will all be shown as separate windows.
To create a new figure we can use the following function. Leaving the input argument empty will create a new figure:
To view the figure we need to show it:
If no figures are open this will be "Figure 1". We can also specify the figure number using:
Now that we have Figure 1000, if we once again type in:
We will get Figure 1000 +1 i.e. Figure 1001
The figures can be closes using the x on the top right corner or by using the command close with the input argument being the figure number in our case 1, 1000 and 1001:
Will close all open figures.
Random Number Generators
In order to make this reproducible, we will first rest the seed of the random number generator to 0.
Let us now have a look at the random normal function (randn). This should give a normal randomly distributed number (around the origin). We can create an array of 10 randomly distributed numbers.
For a small set of numbers it is fine to view these as a vector however to understand the distribution it may be more useful to plot them as a histogram:
However it is also useful to show this data set as a histogram:
Bins and Range
This creates a Histogram however there is insufficient data to see how the distribution works, to rectify this we can increase n. In the example above MatPlotLib has determined what it thinks is the lower and upper bound of the histogram bins and used ~10 bins by default. We can add additional arguments to set these.
Here we can see the shape of a Normal distribution begin to take place. The data still isn't good enough to be completely sure.
Colour and Transparency
We can measure three times and overlay the three plots on a single graph. To distinguish the three plots we can use the additional input argument color (US spelling). Primary and Secondary colours aswell as Black and White can be encoded using a single letter string and also a string of the full name of the colour.
|Single Letter |
For more fine tuning colours can be specified as a vector of [r,g,b] values. Many programs list this vector of [r,g,b] values between 0 and 255 but Python recognises these are normalised values between 0 and 1. For instance the standard colours in Microsoft Word are as follows. Using these colours may be useful if you want to keep consistency with plots and a Word Document for instance.
|Microsoft Word RGB||Hex|
Now these figures appear to overlay however it is hard to see them because the last plotted histogram (blue) covers the first plotted histogram (green) which in term covers the zeroth plotted histogram. This can be amended with some transparency:
This data overlays pretty well, I will modify it so a is -1 with respect to b and c is +1 with respect to b so you can see the effects of transparency in more detail:
Sometimes [r,g,b] is combined with the alpha parameter to give a set of 4 values [r,g,b,a] where the 4th value corresponds to the alpha value or transparency. We can use these numeric values to return red, green and blue and apply the alpha values we had earlier.
Another colouring system is called the Hexadecimal or hex colouring system, in this colouring system each channel corresponds to 2 characters ranging from 0 to F (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F) and 16×16=256 (0-255 with 0 indexing). once again we have 3 rgb channels or 4 channels if we include the alpha. This reproduces the chart from earlier:
In general people just look up the colour values and apply the one they want but for a more detailed explanation behind the fundamentals see
Line Width and Edge Colour
Moving back to a single chart and noting that the statistics aren't good enough, we'll increase the n, 100 fold to 10000 and we'll look at also adding outlines to each bar by specifying additional arguments, edgecolor and linewidth.
Now we can see it resembles a Normal Gaussian Function, that the lines of each bar are black and darkened, and the bars are the same colour as the green taken from Microsoft Word.
We can change the hatching of the bar using the additional input argument hatch
Toggling through the different hatch styles we get:
Plot Labels, Labels and Legend
The axes can be labelled using xlabel (line 7), ylabel (line 8) and the title can be labelled (line 9). The plot can be assigned a label set as a string (line 6). This will show up as a legend if a legend is specified for the plot (line 10).
The location of the legend can be set using the input argument loc and assigning it to a string or a number. Note once again that English US is used for center opposed to the English UK version centre. Unfortunately with the numerical input, it is implemented in the following way:
|Location String||Location Integer|
Opposed to using the shape of the number square which would have made much more sense.
Here we can explicitly set, the location of the legend (line 10).
You'll notice the title above is quite long. If we want to separate it onto two lines we can use the special character \n. Note that we have also had to specify the value of n i.e. 10000 on line 1 and on line 9. We can instead reference the variable n we assigned on line 1, by typing in %i (i for integer) where we want the integer n specified. After we close the quotes we then put a % and enclose the variable we want to reference. In this case we only specify one variable.
With this done we can change n=1000000 in line 1 and it will autoupdate:
If we wanted to reference more we would need to put in more markers for them %i (integer), %f (float) or %s string and then at the end put in multiple inputs in the %( ).
One can also enable gridlines on their plot using the command grid, the first input argument b is a Boolean specifies whether they are enabled or disabled, the second input argument which specifies which gridline to be amended the minor, major or both minor or major gridlines. The color argument specifies the colour of the gridline and one may also change the linestyle for instance : for a dotted line (this will be covered in more detail when line plots are examined).
Supposing we wanted two Histograms on the same subplot for instance if we are interested in comparing the random distribution to the random normal distribution, we could use a subplot. For the subplot, we specify the number of rows and then the number of columns of the subplot, the third input argument is the position, with position 1 being the first subplot,position 2 the second and so on and so forth. The positions start to the top left and go row wise until the last element is reached:
For comparison this is a 2 by 2 plot.
Obviously you may be limited when it comes to axes, legends, titles etc. when you try to fit them into a smaller screen. So we will remove the titles and the legends and instead only plot a single supertitle using the function suptitle. We will also remove the x-axis of the top histogram as it is shared with the bottom histogram. The second dataset can be seen to be between 0 and 1 and it is clear that there are not enough bins so we will increase these also.
We can see that the random function rand, distributes the numbers between 0 and 1 and with the exception to the end bins which go half under 0 and half over 1, give approximately equal values per bin. Whereas the random randn, distributes the numbers around the centre in the form of a Gaussian.