In this guide we will look at creating a histogram plot of randomly generated data. We will need to first start by loading three libraries, numpy, pandas and matplotlib.pyplot. These are typically loaded as np, pd and plt respectively.
Configuring the Layout of Figures
Before creating any figures, you should adjust your preferences for how you wish to display figures. The default option is inline which means all figures will be printed to the Console as shown:
If instead you want the Figures to be shown as a separate Window, you can change the setting to Automatic. To do this go to Tools → Preferences:
To do this go to Tools → Preferences:
Next on the left hand menu select iPython console:
Change the setting from Inline to Automatic:
Now go to Consoles and Restart the Kernal:
When rerunning your code, your figure will be in a separate window opposed to being inline within the Console:
Note Spyder Version 3.3 may give a stream of errors instead of making a plot. If you have this version (installed by default with the Anaconda March 2019 installer) you should close down Spyder and then update both Anaconda and Spyder. To do this open the Anaconda PowerShell Prompt and type in:
Note it is also possible to toggle between the two settings without restarting the Kernal using the following commands:
In these guides, the setting automatic will be applied and the figures will all be shown as separate windows.
To create a new figure we can use the following function. Leaving the input argument empty will create a new figure:
To view the figure we need to show it:
If no figures are open this will be “Figure 1”. We can also specify the figure number using:
Now that we have Figure 1000, if we once again type in:
We will get Figure 1000 +1 i.e. Figure 1001
The figures can be closes using the x on the top right corner or by using the command close with the input argument being the figure number in our case 1, 1000 and 1001:
Will close all open figures.
Random Number Generators
In order to make this reproducible, we will first rest the seed of the random number generator to 0.
Let us now have a look at the random normal function (randn). This should give a normal randomly distributed number (around the origin). We can create an array of 10 randomly distributed numbers.
For a small set of numbers it is fine to view these as a vector however to understand the distribution it may be more useful to plot them as a histogram:
However it is also useful to show this data set as a histogram:
Bins and Range
This creates a Histogram however there is insufficient data to see how the distribution works, to rectify this we can increase n. In the example above MatPlotLib has determined what it thinks is the lower and upper bound of the histogram bins and used ~10 bins by default. We can add additional arguments to set these.
Here we can see the shape of a Normal distribution begin to take place. The data still isn’t good enough to be completely sure.
Colour and Transparency
We can measure three times and overlay the three plots on a single graph. To distinguish the three plots we can use the additional input argument color (US spelling) and for primary and secondary colours the full name (red, green, blue, magenta, yellow, cyan, white or black) or single letter color abbreviation (r,g,b,m,y,c,w,k) work as a string input argument:
Now these figures appear to overlay however it is hard to see them because the last plotted histogram (blue) covers the first plotted histogram (green) which in term covers the zeroth plotted histogram. This can be amended with some transparency:
This data overlays pretty well, I will modify it so a is -1 with respect to b and c is +1 with respect to b so you can see the effects of transparency in more detail:
The colour names and single digit characters only work for the primary colours, secondary colours and white and black. Colours can also be expressed a set of three numerical values, one for each primary channel [r,g,b] values. Colours are often expressed as 8 bit numeric values ranging from between 0 to 255, the colour picker in Microsoft Word for instance uses these values:
However in the case of MatPlotLib, it uses floating point numbers between 0 and 1. Thus we can divide the 8 bit values by 255. Sometimes this is combined with the alpha parameter to give a set of 4 values [r,g,b,a] where the 4th value corresponds to the alpha value or transparency.
We can use these numeric values to return red, green and blue and apply the alpha values we had earlier.
Another colouring system is called the Hexadecimal colouring system, in this colouring system each channel corresponds to 2 characters ranging from 0 to F (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F) and 16×16=256 (0-255 with 0 indexing). once again we have 3 rgb channels or 4 channels if we include the alpha. This reproduces the chart from earlier:
In general people just look up the colour values and apply the one they want but for a more detailed explanation behind the fundamentals see
Line Width and Edge Colour
Moving back to a single chart and noting that the statistics aren’t good enough, we’ll increase the n, 100 fold to 10000 and we’ll look at also adding outlines to each bar by specifying additional arguments, edgecolor and linewidth.
Now we can see it resembles a Normal Gaussian Function, that the lines of each bar are black and darkened, and the bars are the same colour as the green taken from Microsoft Word.
We can change the hatching of the bar using the additional input argument hatch
Toggling through the different hatch styles we get:
To get a label for a dataset, for example the histogram plot of a. use the additional input argument label and input your label as a string enclosed in quotation marks. For the x label, y label and title, there are functions xlabel, ylabel and title which all take a string as an input argument. We can use the legend function to show the legend and use the input argument loc to specify the location of the legend. The location is also available as a string or a number. Unfortunately with the numerical input, it is implemented in the following way:
|Location string||Numeric Value|
Opposed to using the shape of the number square which would have made much more sense.
You’ll notice the title above is quite long. If we want to separate it onto two lines we can use the special character \n. Note that we have also had to specify the value of n i.e. 10000 on line 1 and on line 9. We can instead reference the variable n we assigned on line 1, by typing in %i (i for integer) where we want the integer n specified. After we close the quotes we then put a % and enclose the variable we want to reference. In this case we only specify one variable.
With this done we can change n=1000000 in line 1 and it will autoupdate:
If we wanted to reference more we would need to put in more markers for them %i (integer), %f (float) or %s string and then at the end put in multiple inputs in the %( ).
One can also enable gridlines on their plot using the command grid, the first input argument b is a Boolean specifies whether they are enabled or disabled, the second input argument which specifies which gridline to be amended the minor, major or both minor or major gridlines. The color argument specifies the colour of the gridline and one may also change the linestyle for instance : for a dotted line (this will be covered in more detail when line plots are examined).
Supposing we wanted two Histograms on the same subplot for instance if we are interested in comparing the random distribution to the random normal distribution, we could use a subplot. For the subplot, we specify the number of rows and then the number of columns of the subplot, the third input argument is the position, with position 1 being the first subplot,position 2 the second and so on and so forth. The positions start to the top left and go row wise until the last element is reached:
For comparison this is a 2 by 2 plot.
Obviously you may be limited when it comes to axes, legends, titles etc. when you try to fit them into a smaller screen. So we will remove the titles and the legends and instead only plot a single supertitle using the function suptitle. We will also remove the x-axis of the top histogram as it is shared with the bottom histogram. The second dataset can be seen to be between 0 and 1 and it is clear that there are not enough bins so we will increase these also.
We can see that the random function rand, distributes the numbers between 0 and 1 and with the exception to the end bins which go half under 0 and half over 1, give approximately equal values per bin. Whereas the random randn, distributes the numbers around the centre in the form of a Gaussian.