Table of contents

- Perquisites
- Machine Learning
- The SciKit Library
- The Iris Example Bunch (4 Features and Target Classification)
- Converting the Bunch into a DataFrame and Saving to an Excel File
- Importing the Excel File as a DataFrame
- Data Visualization
- The Label Encoder Transformer Class
- The Standard Scalar Transformer Class
- The Principle Component Analysis Transformer Class
- The Isomap Transformer Class
- The K Nearest Neighbors Classifier Estimator Class
- The Curse of Dimensionality
- The Naive Bayes Classifier Estimator Class
- The Decision Tree Classifier Estimator Class
- The Random Forest Classifier Model of Estimators Class
- The Support Vector Classifier Estimator Class
- The Logistic Regression Predictor Class

- Converting the Bunch into a DataFrame and Saving to an Excel File
- The Wine Example Bunch (13 Features and Target Classification)
- The Principle Component Analysis Transformer Class
- The Isomap Transformer Class
- The K Nearest Neighbors Classifier Estimator Class
- The Naive Bayes Classifier Estimator Class
- The Random Forest Classifier Model of Estimator Class
- The Support Vector Classifier Estimator Class
- The Logistic Regression Predictor Class

- The Principle Component Analysis Transformer Class
- The Digits Example Bunch (64 Features and Target Classification)
- The Principle Component Analysis Transformer Class
- The Isomap Transformer Class
- The K Nearest Neighbors Classifier Estimator Class
- The Naive Bayes Classifier Estimator Class
- The Random Forest Classifier Model of Estimator Class
- The Support Vector Classifier Estimator Class
- The Logistic Regression Predictor Class

- The Principle Component Analysis Transformer Class
- The Breast Cancer Example Bunch (30 Features and Binary Target Classification)
- The Principle Component Analysis Transformer Class
- The Isomap Transformer Class
- The K Nearest Neighbors Classifier Estimator Class
- The Naive Bayes Classifier Estimator Class
- The Random Forest Classifier Model of Estimator Class
- The Support Vector Classifier Estimator Class
- The Logistic Regression Predictor Class

- The Principle Component Analysis Transformer Class
- The Diabetes Example Bunch (10 Features and Target Regression)
- The Boston Example Bunch (13 Features and Target Regression)

## Perquisites

Before looking at the SciKit library sklearn you should make sure you are comfortable using the core Python library in addition to numpy, pandas and matplotlib.

## Video

## Machine Learning

## The SciKit Library

### Data General Form

When using the sklearn library the data being investigated is usually initially in the form of a dataframe.

The dataframe has the following general form. It consists of a number of columns. To the left hand side all the columns are known as features (which can be thought of as independent variables). The last column is known as a target (dependent variable).

Each row or index corresponds to an observation.

We can therefore take a single row as being the results of an experiment where the values in fea1 and fea2 are the measurement of some independent variables and the target is measured in response.

import pandas as pd data=[[1,2,3],[1,2,3],[1,2,3]] df=pd.DataFrame(data,columns=['fea1','fea2','target'])

For machine learning we tend to create a matrix of the attributes X and a vector of the target y. If x consists of a single feature it is a vector represented by lower case x. These need to be numpy arrays. Capitalization denotes a matrix and lower case is indicative of a vector.

To get X we can use.

X=df.drop('target',axis=1).to_numpy()

To get y we can use.

y=df['target'].to_numpy()

### Modules

The sklearn library is arranged into a series of modules and each module contains a series of classes and functions. These need to be imported in order to be used. To view the modules we can type the following in

from sklearn import

followed by a space and then a tab â†¹

In this list we see the following modules metrics, preprocessing, model_selection, linear_model, ensemble, svm, feature_extraction, utils, datasets, decomposition, neighbors, tree, naive_bayes, externals, cluster, feature_selection, pipeline, base, neural_network, manifold.

Instead of importing a module directly we tend to import a class or function from a module recall we can select a module from the library by calling the library followed by a dot . and then the module name.

Note that the library and the module are both in lower case.

### Using the sklearn library

The general procedure for using sklearn is.

#### Step 1: Import the Class from the Module

Import the class or function you wish to use from the appropriate module. This has the general form.

from sklearn.module import CustomClass from sklearn.module import custom_function

If we take the module neighbors and type in:

from sklearn.neighbors

followed by a space and then a tab â†¹

We will see classes (blue icon), functions (orange icon) and modules (yellow icon). Note the difference in naming, classes use CamelCaseCapitalization and functions use lower case with separation using an underscore.

In many Python script files all imports are placed at the top of the script file so it is easy to identify all the libraries and modules used.

For the sklearn library however it is more common to import the class in the script file just before using it.

#### Step 2: Initiate the Class

When using a class we need to instantiate the class (create an instance of the class):

from sklearn.module import CustomClass cc=CustomClass(kwargs**)

To do this we call the class and assign it to an instance name. The instance name is usually lower case. The keyword input argument are used to tune the class parameters. If they are not specified, the default parameters will be given.

#### Step 3: Use the Appropriate Methods from the Instance of the Class

Recall that we can view a list of methods and attributes available from an object name by typing in the objects name followed by a dot . and then tab â†¹.

Recall that a method is a function that belongs to an object (for example an instance of a class) and the terms are often used interchangeably. Functions need to be called with parenthesis which enclose positional input arguments (mandatory) and keyword input arguments (optional, take on a default value if not specified).

#### Estimator

Estimator classes are the predominant classes of the sklearn library. These are used to fit data (usually in the form X,y) to a estimator and from the estimator predict the target of new new data. An estimator has the methods fit, predict and predict_proba.

from sklearn.module import Estimator est=Estimator() est.fit(X,y) newy=est.predict(newX)

For an estimator we use the method fit, to fit the estimator with data. This method has no output and performs an in place update to the instance.

Once our estimator is fitted with data we can use the method predict to predict a value from a new matrix of features. This method has a return statement which is the predicted target corresponding determined by the estimator. Alternatively we can use the method predict_proba to predict the probability of each observation belonging to each discrete target type.

#### Transformer

The sklearn library also has a number of transformer classes which are used to transform usually an individual feature or target when preprocessing data. A transformer has the methods fit and transform.

from sklearn.module import Transformer tran=Transformer() tran.fit(y) newy=tran.transform(y)

Note that the method fit once again acts in place and returns no output.

Because the fit and transform normally use the same data as an input argument, they are often combined into a single method fit_transform.

from sklearn.module import Transformer tran=Transformer() newy=tran.fit_transform(y)

#### Predictor

Predictor classes are similar to estimators except they also have the method predict_proba which predicts the probability of an observation belonging to each cluster in a classification problem.

from sklearn.module import Predictor predictor=Prediction() predictor.fit(X,y) prediction=predictor.predict_proba(data)

#### Model

The model class normally has the methods fit and score. A model can be thought of as containing an array of estimators each with different parameters and scores. They are usually used to evaluate the best parameters for an estimator.

from sklearn.module import Model model=Model() model.fit(X,y) score=model.score(data)

## The Iris Example Bunch (4 Features and Target Classification)

Example data is contained within the datasets module of the sklearn library. We need to import the appropriate function for example load_iris, load_digits, load_breast_cancer, load_boston, load_diabetes and load_wine.

To get the data we need to call the function and assign the output to the variable name iris.

from sklearn.datasets import load_iris iris=load_iris()

The object type is a bunch which is the sklearn special object type for storing datasets as a series of attributes.

We can double click this bunch object to open it up within the variable explorer.

Let's first look at the DESCR, here we see that we have 4 features and 150 observations and 3 discrete categories.

Let's look at the data. It has 4 columns and each column is known as a feature.

The description for these features is in feature names.

The target data is numerical. In this case there are three categories 0, 1 and 2.

because the data consists of features:target pairs it can be used for supervised machine learning.

These numerical categories corresponded to the categories of the 3 types of iris which are within target_names.

Because the target is categorical this is known as a classification problem.

### Converting the Bunch into a DataFrame and Saving to an Excel File

Although the dataset is already separated out for us, ready for use with sklearn. It is perhaps useful to recombine it together into a dataframe that we can save to an excel file.

To do this we can create a new dataframe supplying iris.data as a positional input argument and iris.feature_names to the keyword input argument columns. We can then create a new column 'target' and assign the iris.target data to it. We can change this to a category type. Then calculate the size of target_names and assign it to ntargets. We can then create an empty dictionary catdict and loop over the range of ntargets, updating the dictionary with a new key:value pair where the key is the loop variable and the value is obtained by indexing into target_names by the loop variable. We can use this cat_dict as the input argument to the method rename_categories, called from the cat attribute to rename the categories.

import numpy as np import pandas as pd from sklearn.datasets import load_iris iris=load_iris() irisdf=pd.DataFrame(iris.data, columns=iris.feature_names) irisdf['target']=iris.target irisdf['target']=irisdf['target'].astype('category') ntargets=np.size(iris.target_names) catdict={} for i in range(ntargets): catdict[i]=iris.target_names[i] irisdf['target']=irisdf['target'].cat.rename_categories(catdict) irisdf.to_excel('iris.xlsx')

Note in this case as the feature names do not follow the rules of naming variables, they do not show up as attributes meaning attribute dot . indexing cannot be used to access each column from the irisdf the name of the dataframe. They can only be accessed by indexing into irisdf by use of square bracket notation.

This gives the Excel File.

### Importing the Excel File as a DataFrame

We will import the Excel file as a dataframe and then convert it into the numerical data matrix X and target vector y.

Before doing this let's have a look at what we have. In yellow we have the index which denotes each row commonly referred to as an observation (sample or record). In red we have the names of four columns known as features. We want to create a numerical data matrix X from the data highlighted in blue. In green we have the target. The target is categorical corresponding to the name of three different types of iris, we to convert this into a numeric target vector y.

First of all we will read the Excel file using the read_excel function from the pandas pd library. The input argument is the excel file as a string (because the script is in the same folder we do not need to specify the full path). We will assign the index_col to 0 opposed to creating a duplicate index column. We can index into irisdf to select the column 'target' and then use the method astype to change it to the type 'category'.

import numpy as np import pandas as pd from sklearn.datasets import load_iris irisdf=pd.read_excel('iris.xlsx',index_col=0) irisdf['target']=irisdf['target'].astype('category')

### Data Visualization

The next thing we will want to do is to attempt to visualize the data and we can do this by using the pairplot from within the seaborn library, we will also need to load in the matplotlib library.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import seaborn as sns irisdf=pd.read_excel('iris.xlsx',index_col=0) # create pairplot to visualise the data fig1=sns.pairplot(irisdf,hue='target')

The pairplot creates a histogram plot along the diagonal for each single feature and each other plot is a scatter plot which utilizes 2 features. We don't need to get bogged down in the specific details of the features used in each plot. What we are most interested in is how well separated out the three classifications are from one another and whether we can use the plot to identify an unknown datapoint.

A classification problem visually can be though of as the addition of an unknown point on the plot. Let's selecting two of the visual data representations from the above and remove the axes and place an unknown point the black dot on each. The classification problem is essentially asking you to identify whether the black dot is more likely to belong to the blue classification 0, the orange classification 1 or the green classification 2. We can do this visually without any mathematics.

The data representation on the left hand side separates out the blue classification 0 from the rest however it is very hard to distinguish the boundary between the orange classification 1 and the green classification 2. The plot on the right hand side does a better job at setting a clearer boundary between the orange classification 1 and green classification 2 although there is a slight overlap. However visually we would be more likely to classify the black dot as orange classification 1.

Of course we can also visualize and interact with plots in 3D. To demonstrate we are just going to select three features. The plot on the right hand side had petal length on the x-axis and petal width on the y axis. For demonstration we can just add sepal length to the z axis.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import seaborn as sns irisdf=pd.read_excel('iris.xlsx',index_col=0) irisdf['target']=irisdf['target'].astype('category') fig2=plt.figure() ax2=fig2.add_subplot(111, projection='3d') ax2.scatter(irisdf['petal length (cm)'][irisdf['target']=='setosa'], irisdf['petal width (cm)'][irisdf['target']=='setosa'], irisdf['sepal length (cm)'][irisdf['target']=='setosa'], color='b') ax2.scatter(irisdf['petal length (cm)'][irisdf['target']=='versicolor'], irisdf['petal width (cm)'][irisdf['target']=='versicolor'], irisdf['sepal length (cm)'][irisdf['target']=='versicolor'], color='r') ax2.scatter(irisdf['petal length (cm)'][irisdf['target']=='virginica'], irisdf['petal width (cm)'][irisdf['target']=='virginica'], irisdf['sepal length (cm)'][irisdf['target']=='virginica'], color='g') ax2.set_xlabel('petal length (cm)') ax2.set_ylabel('petal width (cm)') ax2.set_zlabel('sepal length (cm)')

We can rotate the plot to see the same information we seen in the 2D plot.

Alternatively we can rotate the plot in order to visualize the maximum contrast between the features.

Our data has four dimensions, visually as humans we cannot perceive four dimensions however we can mathematically use the fourth dimension to classify an unknown datapoint.

### The Label Encoder Transformer Class

Let's look at our dataframe of data. We need a numerical matrix X of features and a numerical vector y of targets.

We want to transform our target categories into numeric values so we can use it for Machine Learning.

To do this we can use the LabelEncoder class from the preprocessing module. We will need to import the class using:

from sklearn.preprocessing import LabelEncoder

We can highlight the class name in the script and press [Ctrl] + [i] to inspect it to get details about the class in the help pane.

We will need to instantiate the LabelEncoder class and assign it to an object name in this case le. This class has no keyword input arguments.

from sklearn.preprocessing import LabelEncoder le=LabelEncoder()

Then we can call the methods from the object name (instance of the LabelEncoder class).

We can use fit and then transform with both having irisdf['target'] as an input argument. The method fit performs an in place update of the instance and has no output and is required before additional methods such as transform are used. The method transform has an output and in our case as we know this is the target data we can assign it to y.

from sklearn.preprocessing import LabelEncoder le=LabelEncoder() le.fit(irisdf['target']) y=le.transform(irisdf['target'])

This gives y with numeric labels opposed to the strings which were found in the original irisdf['target'].

Since we are using the preprocessing module and the methods fit and transform have the same input argument. We can simplify the above using the method fit_transform instead.

from sklearn.preprocessing import LabelEncoder le=LabelEncoder() y=le.fit_transform(irisdf['target'])

We will also need to create the matrix X and since the data is already numerical we can simply use the method drop from the irisdf dataframe (as we want to remove a column we specify axis=1 which is similar to indexing a matrix X[m,n] the 0th element is the row m and the first element is the column n) and then use the method to_numpy to create a numpy array opposed to a dataframe.

X=irisdf.drop('target',axis=1).to_numpy()

We have seen how to load data from an excel file as a dataframe and preprocess X and y from the dataframe. These steps are required when dealing with your own data.

X and y are however directly available within the bunch and can be accessed using the keys 'target' and 'data' respectively and we will use these as a starting point from now on.

from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data']

### The Standard Scalar Transformer Class

All the features have a different numeric range. It is quite common to transform the features to a standard range when using machine learning.

The StandardScalar transformer class will transform your data so each feature has a mean value 0 and a standard deviation of 1. To use the StandardScaler transformer class we must first import it from the preprocessing module:

Then we need to instantiate the class. We can type the class name with open parenthesis to get details about it's positional and keyword input arguments (we will leave these all at default).

We can then call the fit_transform method

from sklearn.preprocessing import StandardScaler standardscalar=StandardScaler() X2=standardscalar.fit_transform(X)

Note the similar form to the LabelEncoder transformer class as both are transformer classes.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.preprocessing import StandardScaler standardscalar=StandardScaler() X2=standardscalar.fit_transform(X)

np.mean(X2,axis=0) np.std(X2,axis=0)

### The Principle Component Analysis Transformer Class

The iris data set has 4 features which we cannot visualize in a 3D plot. Earlier we selected a subset of the features so we could make a 3D plot. In doing so we essentially threw information away from the feature that we didn't use. In our case we could more or less identify each classification within our 3D plot however in other cases the information contained within the additional feature may be crucial when it comes to separating out the species.

Instead of just throwing a feature away we can use a decomposition transformer class to mathematically perform dimensionality reduction and also scale the new dimensions to have a mean value 0. Dimensionally reduction essentially uses mathematics to perform combinations of the original features. These mathematical combinations may not make intuitive sense to us however can serve as a new set of axes which we can use to plot and visualize out data.

Many dimensionality reduction transformer classes are found within the decomposition module. In this case we will use the transformer class PCA. We need to first import the class from the module:

from sklearn.decomposition import PCA

Then we need to instantiate the class. We can type the class name with open parenthesis to get details about it's positional and keyword input arguments.

We will set the number of components n_components to 3 (so we have 3 new features) and leave all the keyword input arguments as defaults.

from sklearn.decomposition import PCA pca=PCA(n_components=3)

Now we can look at the methods available to the transformer class by typing the isntance followed by a dot . and tab â†¹.

We can call the method fit_transform to transform the feature data X from 4 dimensions to 3 dimensions. Note the similarity in syntax to the LabelEncoder transformer.

from sklearn.decomposition import PCA pca=PCA(n_components=3) X2=pca.fit_transform(X)

The original features (columns) 0, 1, 2 and 3 corresponded to sepal length (cm), sepal width (cm), petal length (cm) and petal width (cm).

With the PCA transformation we instead get features 0, 1 and 2. We do not know what the physical meaning of these feature are however we can use them within a 3D plot.

We can plot this out in 3D.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=3) X2=pca.fit_transform(X) fig3=plt.figure() ax3=fig3.add_subplot(111, projection='3d') ax3.scatter(X2[:,0][y==0], X2[:,1][y==0], X2[:,2][y==0], color='b') ax3.scatter(X2[:,0][y==1], X2[:,1][y==1], X2[:,2][y==1], color='r') ax3.scatter(X2[:,0][y==2], X2[:,1][y==2], X2[:,2][y==2], color='g') ax3.set_xlabel('feature 0') ax3.set_ylabel('feature 1') ax3.set_zlabel('feature 2')

Here we once see that classification 0 (blue) is well separated from classification 1 and classification 2. Classification 1 and classification 2 are also separated but there is a region of overlap between the two classifications.

We can repeat the procedure but this time reduce the number of components down to 2 instead of 3.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) fig4,ax4=plt.subplots() ax4.scatter(X2[:,0][y==0],X2[:,1][y==0],color='b') ax4.scatter(X2[:,0][y==1],X2[:,1][y==1],color='r') ax4.scatter(X2[:,0][y==2],X2[:,1][y==2],color='g') ax4.set_xlabel('feature 0') ax4.set_ylabel('feature 1')

We can repeat the procedure but this time reduce the number of components down to 1.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=1) x2=pca.fit_transform(X) fig5,ax5=plt.subplots() ax5.scatter(x2[:,0][y==0],np.zeros(50),color='b') ax5.scatter(x2[:,0][y==1],np.zeros(50),color='r') ax5.scatter(x2[:,0][y==2],np.zeros(50),color='g') ax5.set_xlabel('feature 0') ax5.set_ylabel('')

With only one dimension we can see a strong overlap between the green and red datapoints.

### The Isomap Transformer Class

Another dimensional reduction technique is the Isomap. Note because it is a transformer class it has the same type of form as:

from sklearn.module import TransformerClass transformer=TransformerClass() X2=transformer.fit_transform(X) from sklearn.decomposition import PCA pca=PCA(n_components=3) X2=pca.fit_transform(X) from sklearn.manifold import Isomap iso=Isomap(n_components=3) X2=iso.fit_transform(X)

We can therefore use much of the same code as before to use Isomap to transform the data into 3D, 2D and 1D and make the respective plots:

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.manifold import Isomap iso=Isomap(n_components=3) X2=iso.fit_transform(X) fig3=plt.figure() ax3=fig3.add_subplot(111, projection='3d') ax3.scatter(X2[:,0][y==0], X2[:,1][y==0], X2[:,2][y==0], color='b') ax3.scatter(X2[:,0][y==1], X2[:,1][y==1], X2[:,2][y==1], color='r') ax3.scatter(X2[:,0][y==2], X2[:,1][y==2], X2[:,2][y==2], color='g') ax3.set_xlabel('feature 0') ax3.set_ylabel('feature 1') ax3.set_zlabel('feature 2')

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.manifold import Isomap iso=Isomap(n_components=2) X2=iso.fit_transform(X) fig4,ax4=plt.subplots() ax4.scatter(X2[:,0][y==0],X2[:,1][y==0],color='b') ax4.scatter(X2[:,0][y==1],X2[:,1][y==1],color='r') ax4.scatter(X2[:,0][y==2],X2[:,1][y==2],color='g') ax4.set_xlabel('feature 0') ax4.set_ylabel('feature 1')

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.manifold import Isomap iso=Isomap(n_components=1) x2=iso.fit_transform(X) fig5,ax5=plt.subplots() ax5.scatter(x2[:,0][y==0],np.zeros(50),color='b') ax5.scatter(x2[:,0][y==1],np.zeros(50),color='r') ax5.scatter(x2[:,0][y==2],np.zeros(50),color='g') ax5.set_xlabel('feature 0') ax5.set_ylabel('')

### The K Nearest Neighbors Classifier Estimator Class

To visualize the concept behind K nearest neighbors, we are going to simplify the dataset by selecting every 10th point of the data set. We will rename the transformed data X3 and y3 respectively.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10]

Next we will plot the scatter plot.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] fig5,ax5=plt.subplots() ax5.scatter(X3[:,0][y3==0],X3[:,1][y3==0],color='b') ax5.scatter(X3[:,0][y3==1],X3[:,1][y3==1],color='r') ax5.scatter(X3[:,0][y3==2],X3[:,1][y3==2],color='g') ax5.set_xlabel('feature 0') ax5.set_ylabel('feature 1')

To this plot we will take two points (33 and 133) from the original dataset X2 that are not within the smaller dataset X3 and add them to the plot colored in black and magenta respectively.

ax5.scatter(X2[33,0],X2[33,1],color='k') ax5.scatter(X2[133,0],X2[133,1],color='m')

Since we have taken these points from the original datasets we know that they correspond to classification 0 and classification 3 respectively. i.e. the black down belongs with the blue dots and the magenta dot belongs with the green dots.

Looking at the scatter plot let's zoom into the black dot and then look at the closest datapoint to it. The closest datapoint to it is blue so we classify it also as blue. However it could just be beside a single outlier datapoint so we can instead look to the two closest datapoints and both are blue so we classify it as blue. If we go up to the 5 nearest data points they are all blue so we classify it as a blue datapoint.

Now let's zoom into the magenta point and then get a ruler our and draw lines. We can then measure the length of these lines.

- Now if wec onsider only the single nearest neighbor we would take the red point with a distance of 7.56 and incorrectly classify this as a red datapoint.
- If we consider only the 2 nearest datapoints we would take the red point with a distance of 7.56 and the green datapoint with a distance of 14.84. Since we have equal quantities of the two different classifications we would then consider the length and it is closer to the red data point so classify it incorrectly as a red datapoint.
- If we consider the 5 nearest datapoints we would take 3 green datapoints and 2 red datapoitns and classify it as a green datapoint.

sklearn has the neighbors module which contains a number of estimators which use similar metrics to that depicted above. Let's use the KNeighborsClassifier.

We first import the estimator class.

from sklearn.neighbors import KNeighborsClassifier

Next we want to instantiate the class. Let's call it with open parenthesis to view the positional and keyword input arguments.

Here we see only the keyword input argument n_neighbors which has a default value of 5. We won't input the keyword input argument here so we will just use the default.

We can now type in the class followed by a dot . and tab â†¹ to get a list of methods available.

And because we are using an estimator we are most interested in the methods fit and predict.

Note once again that the fit method is in place updating the instance of the class and not returning any output.

from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() knn.fit(X3,y3)

We can use the method predict to predict unknown targets (dependent variables) from a matrix of known features (independent variables).

We will combine the black and magenta datapoint into a matrix X4. This matrix has the same number of features as the X3 data used to in the method fit but has a differing number of observations, The method fit does have a return value.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] X4=np.array([X2[33,:],X2[133,:]]) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() knn.fit(X3,y3) y4_pred=knn.predict(X4)

y_pred returns the vector [0,2] which means the black data point is predicted to be blue (recall that blue is used to represent classification 0) and the magenta datapoint is predicted to be green (recall that green is used to represent classification 2).

Now if we instead of using the default, we can specify the keyword input argument n_neighbors=1.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] X4=np.array([X2[33,:],X2[133,:]]) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X3,y3) y4_pred=knn.predict(X4)

In this case y_pred returns the vector [0,2] which means the black data point is predicted to be blue (recall that blue is used to represent classification 0) and the magenta datapoint is incorrectly predicted to be red (recall that red is used to represent classification 1).

We have 2 different estimators which yield a different prediction for the magenta datapoint. In our case we know what the magenta datapoint should be but in a real life application we won't know the actual classification. The question we should ask is what estimator should we use and so we need some sort of evaluation procedure.

#### Train-Test Split Function

In order to determine if our estimator classifies a cluster correctly or not we need data where we known the correct classification so instead of using all the data to fit the estimator, we need to reserve some of the data to test the predictions of estimator. We can split our data into a training data set (80 %) and testing data set (20 %). We can then train our estimator on the training dataset and then test the estimator with the testing data to calculate how many times the estimator accurately predicted a correct classification.

To do this we can use the function train_test_split which we must first import from sklearn.model_selection. Note that this is a function and not a class. The function name is all lower case with the _ used in place of spaces.

from sklearn.model_selection import train_test_split

We can get some details about the input arguments by typing the function with open parenthesis.

To get the full details we can highlight the function within the script by pressing [Ctrl] + [i] to inspect the function, which opens the documentation in the Help pane.

In our case we want to split X3 and y3 into training and testing data so these will be the positional input arguments.

from sklearn.model_selection import train_test_split splitdata=train_test_split(X3,y3, test_size=0.25,train_size=0.75, random_state=1)

We will also use the keyword input arguments test_size=0.25 and train_size=0.75 to use 75 % of data for training and 25 % of the data for testing. This function will randomize the order of the observations in X3 and y3 before splitting them into testing and training sets and the keyword input argument random_state can be used to assign the integer value of the random seed for the sake of predictability. we will set this to 1.

We can see that splitdata is a list with 4 indexes.

It is common to unpack this. This can be done using square brackets on the left hand side i.e. in the form.

[X3_train,X3_test,y3_train,y3_test]=splitdata

However also works without these:

X3_train,X3_test,y3_train,y3_test=splitdata

This is normally done in a single line when calling the function.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] from sklearn.model_selection import train_test_split X3_train,X3_test,y3_train,y3_test=train_test_split(X3,y3, test_size=0.25,train_size=0.75, random_state=1)

The test and training data looks like the following.

And to visualize this we can plot it with the test pints colored black.

fig6,ax6=plt.subplots() ax6.scatter(X3_train[:,0][y3_train==0],X3_train[:,1][y3_train==0],color='b') ax6.scatter(X3_train[:,0][y3_train==1],X3_train[:,1][y3_train==1],color='r') ax6.scatter(X3_train[:,0][y3_train==2],X3_train[:,1][y3_train==2],color='g') ax6.scatter(X3_test[:,0],X3_test[:,1],color='k') ax6.set_xlabel('feature 0') ax6.set_ylabel('feature 1')

#### Evaluation Metrics

Now that we have the training and testing data. We are going to use the training data for the fit and the testing data for predictions. This will give us y3_pred which we can compare to the known classifications y3_test:

from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X3_train,y3_train) y3_pred=knn.predict(X3_test)

We can run this code.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] from sklearn.model_selection import train_test_split X3_train,X3_test,y3_train,y3_test=train_test_split(X3,y3, test_size=0.25,train_size=0.75, random_state=1) fig6,ax6=plt.subplots() ax6.scatter(X3_train[:,0][y3_train==0],X3_train[:,1][y3_train==0],color='b') ax6.scatter(X3_train[:,0][y3_train==1],X3_train[:,1][y3_train==1],color='r') ax6.scatter(X3_train[:,0][y3_train==2],X3_train[:,1][y3_train==2],color='g') ax6.scatter(X3_test[:,0],X3_test[:,1],color='k') ax6.set_xlabel('feature 0') ax6.set_ylabel('feature 1') from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X3_train,y3_train) y3_pred=knn.predict(X3_test)

And because our test dataset is very small we can easily visualize the difference between these by viewing both within the variable explorer.

We can see that the test observation at index 1 is incorrectly classified as classification 2 opposed to classification 1 however everything else is correct. This becomes harder to quantity for hundreds to thousands of test observations so we may want to quickly compute a metric.

#### Accuracy Score Function

sklearn has a module metrics which contains a number of functions to calculate metrics.

One of the metrics we can compute is the accuracy score which calculates the ratio of correct predictions with respect to the number of test observations.

from sklearn.metrics import accuracy_score acc_score=accuracy_score(y3_test,y3_pred)

In this case the accuracy score is 0.75 as we had 3 correct predictions out of 4 total test observations.

#### Confusion Matrix Function

While the number of correct observations is a useful metric giving a ratio of how accurate the estimator is. It does not tell you where the estimator has gone wrong. We may want to instead use the function confusion_matrix to compute a confusion matrix.

from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y3_test,y3_pred)

The rows of the confusion matrix correspond to the known classifications (0,1 and 2) and the columns correspond to the predicted classifications (0,1,2). The pair 0,0 occurs 2 times which is why the value at 0,0 is 2 in the confusion matrix. Correct classifications lie on the main diagonal and incorrect classifications lie elsewhere. In this case a test datapoint with classification 1 was incorrectly classified as classification 2.

This intuitive makes and tells us that there is some confusion between classification 1 and classification 2 but classification 0 is not generally confused with the other classifications (row 0 and col 0 of the confusion matrix are all zero values with exception to the diagonal).

#### Variance of Accuracy Score

There is a slight problem when computing the accuracy score and confusion matrix when used in the manner above especially for a dataset with a low number of points. This problem is due to the randomization used when performing the test train split. Let us demonstrate by changing the random seed from 1 to 4.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] from sklearn.model_selection import train_test_split X3_train,X3_test,y3_train,y3_test=train_test_split(X3,y3, test_size=0.25,train_size=0.75, random_state=4) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X3_train,y3_train) y3_pred=knn.predict(X3_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y3_test,y3_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y3_test,y3_pred)

This gives the following and we can see that none of the random datapoints are at the boundaries which makes it easier to identify.

As a result the accuracy score is 1 and all the values lie across the main diagonal suggesting the estimator is perfect.

In reality the estimator is not perfect and the accuracy score has a variance. We can have a look at this by looping over integer values of i and changing the random seed. We can append each acc_score to a list acc_scores and perform a plot of this.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) X3=X2[::10] y3=y[::10] acc_scores=[] for i in range(100): from sklearn.model_selection import train_test_split X3_train,X3_test,y3_train,y3_test=train_test_split(X3,y3, test_size=0.25,train_size=0.75, random_state=i) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X3_train,y3_train) y3_pred=knn.predict(X3_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y3_test,y3_pred) acc_scores.append(acc_score) fig7,ax7=plt.subplots() ax7.scatter(range(100),acc_scores) acc_score_mean=np.mean(acc_scores) acc_score_std=np.std(acc_scores)

Here we see that the estimator appears to have an accuracy score within the range 0.5 and 1.0 where it is right half the time and right all of the time.

The acc_score_mean=0.828 and acc_score_std=0.150 respectively.

#### K-Fold Cross Validation function

The mean and standard deviation calculation is in essence a cross-validation. sklearn has a number of cross-validation classes and functions within the model_selection module. A model can be thought of as an array of estimators with varying parameters. Cross-validation gives each estimator within the model a score and during the model selection process we typically select the best estimator within the model based on the score.

When we used test-train split we reserved 25 % of our data for testing and this meant we lost datapoints in order to fit the model with.

The diagram demonstrates 10 Fold Cross Validation.

We take the first 10 % of observations as testing data (the first fold) and use the remaining data to train on. Then the second 20 % of observations as testing data (the second fold) and use the remaining data to test on. We continue this procedure until we take the kth 10 % of observations as testing data (the kth fold) and use the remaining data to test on.

The data is used more efficiently with a K-Fold Cross Validation (each observation is used for training and testing).

To perform the K-Fold cross-validation we use the non-split dataset. In our case we will use our X2 feature matrix and y target vector. This is the original data, not divided into 10 as we need a larger number of datapoints for this function, note also that it is not split into training and testing data. We also need to import and instantiate our estimator.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1)

Next we can import the function cross_val_score from the sklearn.model_selection which is the same module that contains the related function train_test_split.

from sklearn.model_selection import cross_val_score

If we type in the function with open parenthesis we see details about the positional and keyword input arguments.

We can highlight the cross_val_score within the script file and press [Ctrl] + [ i ] to inspect it which will display the documentation in the Help Pane.

The data is X2 and y i.e. the full dataset (folding will be automatically carried by this function). The number of folds k is selected using the keyword input argument cv=10 (cross-validation) and in this case we want to return the accuracy score so assign the keyword argument scoring='accuracy'. The output is going to be a numpy array vector of the accuracy scores for each fold, we can use the methods mean and std to get the mean and standard deviation respectively.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) from sklearn.model_selection import cross_val_score scores=cross_val_score(knn,X2,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

This gives the following numpy array with score_mean is 0.967 and score_std=0.044.

#### Grid Search Cross Validation Model Class

Previously we seen differences in classifications when using the estimator knn with n_neighbors=1 and n_neighbors=5. n_neighbors is known as a tuning parameter and instead of manually creating different instances of kkn, we can create a model which contains a grid of estimators with differing values of n_neighbors alongside their cross-validation scores. Typically we will select the model which has the closest score to 1. To do this we can use the model class GridSearchCV which is also found in the model_selection module.

from sklearn.model_selection import GridSearchCV

In this case if type in GridSearchCV with open parenthesis we see the positional input arguments.

The first positional input argument is estimator. For this we need to import the estimator class and then instantiate the class usually using only the default parameters. In our case:

from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier()

We also have the positional input argument param_grid. We typically setup param_grid as a dictionary. The keys of the dictionary are the keyword input arguments of the estimator class and the corresponding values can be provided as a scalar or a vector.

Let's have a look at the nearest neighbors. We see the keyword input arguments n_neighbors, weights, algorithm, leaf_size, p, metric, metric_params, n_jobs and radius.

Recall dictionaries have the form:

custom_dict={'key1':'value1', 'key2':'value2', 'key3':'value3'}

We can create a param_grid which contains all of the keyword input arguments with their scalar default values except for n_neighbors which has a numeric array for its corresponding value. The model created will be a grid of estimators for each combination of these values.

param_grid={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':'uniform', 'algorithm':'auto', 'leaf_size':30, 'p':2, 'metric':'minowski', 'metric_params':None, 'n_jobs':None, 'radius':None}

We can omit all the keyword input arguments we aren't wanting to change from their default values.

param_grid={'n_neighbors':np.arange(start=1,stop=31,step=1)}

Then we can provide the estimator without the parameters and the param_grid to the model class GridSearchCV as positional keyword input arguments to make a model of multiple estimators. We also want to perform scoring using the keyword argument scoring='accuracy' and in order to perform a 10 Fold Cross Validation we will also change the keyword input arguments cv=10.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_grid={'n_neighbors':np.arange(start=1,stop=31,step=1)} from sklearn.model_selection import GridSearchCV modelknn=GridSearchCV(knn,param_grid, scoring='accuracy',cv=10)

Note when the estimator class was imported and then instantiated however no data was fitted to the estimator. Instead of using the method fit from the instance of the estimator class knn, we use the method fit from the instance of the model class modelknn and this fits the data to the grid of estimators contained within the model.

The model will also has the method predict which will use the estimator within the model that is deemed the best (has a 10-fold cross validated score closest to 1).

We use the method fit providing the X2 and y as the input arguments.

modelknn.fit(X2,y)

The model has a number of attributes, they do not show when using the dot . and tab â†¹ within the script but do display when this is done in the console.

They are also mentioned in the documentation which we can inspect in the help pane by highlighting the class name and using the keyword combination [Ctrl] + [ i ].

Let's assign some of these attributes to variables:

iimport numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_grid={'n_neighbors':np.arange(start=1,stop=31,step=1)} from sklearn.model_selection import GridSearchCV modelknn=GridSearchCV(knn,param_grid, scoring='accuracy',cv=10) modelknn.fit(X2,y) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

We see that cv_results is a dictionary. We can open it up within the variable explorer.

For example if we wanted to lot a scatter plot of the mean_test_score with respect to k.

fig8,ax8=plt.subplots() ax8.scatter(np.arange(start=1,stop=31,step=1), cv_results['mean_test_score']) ax8.set_xlabel('k') ax8.set_ylabel('mean score')

We can see that the best_score (closest to 1) is 0.966 and best_params is a dictionary of the best parameters in this case corresponding to the first occurrence of 0.966 this on the graph which is n_neighbors=3 (the first occurrence of the highest score on the chart). The best_estimator is the estimator class with these parameters however we can call the fit method from modelknn directly opposed to using this attribute to create a separate instance.

We can add 'weights':['uniform','distance'] to the param_grid dictionary to look at the effect of weighting uniformly (default) or weighting by distance.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_grid={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import GridSearchCV modelknn=GridSearchCV(knn,param_grid, scoring='accuracy',cv=10) modelknn.fit(X2,y) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

In this case the best option is n_neighbors=8 and weights='distance' which gives a higher best_score of 0.973.

Note that CV results now has a dimension of 60 (30Ã—2) as it has examined the model over both the parameters in param_grid.

Recall the fact that we performed a PCA transformation of the features for the sake of visualization of the data. Although we cannot visualize the data in 4D, the estimators can understand 4D and higher dimensional data.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_grid={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import GridSearchCV modelknn=GridSearchCV(knn,param_grid, scoring='accuracy',cv=10) modelknn.fit(X,y) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

In this case the best_params are n_neighbors=13 and weights='uniform' which gives a higher best_score of 0.980. This is a higher score as likely some information was lost when the transformation was used.

We can then create a matrix of new features. We have used all the datapoints in our dataset however let's just take two existing sets of features and modify them slightly.

X5=X[[33,133],:]+0.1

Because we are basing these on existing datapoints we expect them to be near datapoint 33 and datapoint 133 which are classified as classification 0 and classification 2 respectively.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_distributions={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import RandomizedSearchCV modelknn=RandomizedSearchCV(knn,param_distributions, n_iter=10,scoring='accuracy', cv=10) modelknn.fit(X,y) y5_pred=modelknn.predict(X5) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

Unsurprisingly y5_pred gives classification 0 for the first datapoint and classification 2 for the second datapoint as expected.

#### Randomized Search Cross Validation Model Class

In the example above we only used two parameters and we went through every single combination of these two parameters. This can be seen by opening params within cv_results. If we add more and more parameters this will become very computationally expensive.

A related model is the RandomizedSearchCV which will only randomly iterate towards to the best parameters opposed to stepping through every single combination. The number of iterations can be set by the keyword input argument n_iter=10 and because we are using random numbers we can also set the random seed using random_state for reproducability. The param_distributions is the same as the param_grid but the terminology reflects that we are going to use them for a distribution and not step across every individual point in the grid.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_distributions={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import RandomizedSearchCV modelknn=RandomizedSearchCV(knn,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) modelknn.fit(X,y) y5_pred=modelknn.predict(X5) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

In this case the best_params are n_neighbors=17 and weights='uniform' which gives an equivalent best_score of 0.980. Unsurprisingly y5_pred once again gives classification 0 for the first datapoint and classification 2 for the second datapoint as expected.

In this case it iterated through the following 10 values (set by the keyword input argument n_iter=10) opposed to going through all 60 values in the grid and the estimator selected gives equivalent or very similar accuracy.

### The Curse of Dimensionality

The Nearest Neighbors algorithm is usually sufficient for one third of classification problems. We discussed earlier how having more features can often make a classification easier however there is also a drawback when using nearest neighbors with higher features as "the nearest neighbor becomes highly separated in a multidimensional space".

Conceptually let's use x to represent a data point. For convenience assume we have uniformly separated datapoints spanning across an entire feature and we want a datapoint spanning across the entire feature so any unknown value lies near a datapoint. For a single feature we can see that the number of data points corresponds to the length of the feature (1 dimension so to the power 1). However if we expand the number of features to 2 then we will need data that covers the span of both axes of features (2 dimensions so to the power 2). If we expand the number of features to 3 we need data that covers the span of 3 axes (3 dimensions so to the power 3) and so on (n dimensions so to the power n). This means that we need exponentially more data as we add additional features. Therefore we must consider other ways to create classification boundaries.

### The Naive Bayes Classifier Estimator Class

The Bayes Theorem states:

The probability of y given X multiplied by the probability of X is equal to the probability of X give y times the probability of y.

In other words the probably of going out because it is sunny multiplied by the probability that it is sunny is equal to the probability of that it sunny because you are out multiplied by the probability that you are out.

In our classification problem we are interested in y given the features X so can divide through to get:

The fundamental Naive Bayes assumption is that each feature makes an independent and equal contribute to the outcome.

Let's assume we have two features, so:

This gives us:

With respect to the axes of our features our training data consists of a number of discrete data points. In order to make predictions of these we need to fit a continuous probability model that we can use as estimators which we can use in place of the right hand side.

We can look at the naive_bayes module which contain a number of estimator classes to make a numerical representation over the range of all the features. GaussianNB class which uses a Gaussian continuous numerical representation:

We will create an instance of this estimator class using the default keyword input arguments:

We will again perform 10 Fold Cross-Cross Validation:

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.naive_bayes import GaussianNB gaussiannb=GaussianNB() gaussiannb.fit(X,y) y5_pred=gaussiannb.predict(X5) from sklearn.model_selection import cross_val_score scores=cross_val_score(gaussiannb,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

The best_score of 0.953 which in this case gives inferior performance to the KNearestNeighbors estimator. y5_pred once again gives classification 0 for the first datapoint and classification 2 for the second datapoint as expected.

### The Decision Tree Classifier Estimator Class

Each feature in X and each target in y are numeric. A decision tree essentially asks a question that divides the dataset into two parts. Let's take the very basic dataset and we can ask a question about one of the two features. Let's start with feature 0.

Suppose we ask is feature 0 less than 0?

In total we have 5 red datapoints and 5 green datapoints.

This gives us

- 2 data points on the left hand side: 2 red and 0 green
- 8 data points on the right hand side: 3 red and 5 green

The gini is defined by:

So on the gini on the left hand side we will be:

And on the right hand side will be:

The gini resulting from this question is the sum of the gini on each side weighted by the ratio of points on each side.

We can use a for loop to check the gini at each possible split along feature 0. Now supposing we got to 1.5.

This gives us

- 5 data points on the left hand side: 5 red and 0 green
- 5 data points on the right hand side: 0 red and 5 green

The gini is defined by:

So on the gini on the left hand side we will be:

And on the right hand side will be:

The gini resulting from this question is the sum of the gini on each side weighted by the ratio of points on each side.

This means we have perfectly split our data. In general a comparison of a feature with a numeric value is made in a for loop until the gini is minimised. In the case above only one single question is required but in more complicated datasets a series of questions would get asked within a for loop. The answer with the minimized gini would be selected and then the next question asked.

Looking at more data.

This question as you can see has more or less divided classification 0 from classification 1 and classification 2.

Then we could ask if feature 0<1.1

From those two questions alone we have.

We can continue branching off to separate more and more of the datapoints at the boundaries.

In this case we will leave all the keyword input arguments at default note because this is an estimator class the syntax is of the same form as the KNeighborsClassifier estimator class.

from sklearn.tree import DecisionTreeClassifier dectree=DecisionTreeClassifier() dectree.fit(X,y) y5_pred=dectree.predict(X5)

Let's use the cross_val_score function to get a 10 fold cross validation mean score.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.tree import DecisionTreeClassifier dectree=DecisionTreeClassifier() dectree.fit(X,y) y5_pred=dectree.predict(X5) from sklearn.model_selection import cross_val_score scores=cross_val_score(dectree,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

Here we see that score_mean=0.953 which is lower than the KNeighborsClassifier estimator used earlier. y5_pred gives a classification of 0 and 1 respectively. However the last datapoint is likely wrong as it was based upon point 133 which is classification 2.

### The Random Forest Classifier Model of Estimators Class

We just discussed a decision tree where each set of branches were made by minimizing the gini. A random forest consists of multiple decisions trees. Different tress are made by differing the ordering of questions, e.g. splitting data by looking at feature 1 instead of feature 0 and subsets of data are used to create different trees. Predictions are submitted by each tree and the final prediction returned to the user is the democratic consensus from the forest. By default the keyword argument n_estimators=100 meaning 100 estimators are created.

We can use this model with all the keyword input arguments set to default.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.ensemble import RandomForestClassifier randforest=RandomForestClassifier() randforest.fit(X,y) y5_pred=randforest.predict(X5) from sklearn.model_selection import cross_val_score scores=cross_val_score(randforest,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

Here we see that score_mean=0.966 which is lower than the KNeighborsClassifier estimator but higher than a single DecisionTreeClassifier used earlier. y5_pred gives a classification of 0 and 2 respectively which is as expected.

### The Support Vector Classifier Estimator Class

Let's create a plot of a selection of only species 1 and species 2. We have seen earlier how we can use the concept of a nearest neighbors estimator to estimate which classification an unknown datapoint belongs to. There are also other models. Looking at only classification 1 and classification 2 which are close to one another we can depict the concept of a support vector machine.

The concept is to divide the two species using a boundary line. The boundary line takes into account the nearest datapoints (known as support vectors) and tries to maximize the sum of squares of these boundaries in order for the boundary line to give the best separation between the two classifications.

In this case any unknown datapoint to the left of the line is designated as belonging to the red clusters and any unknown datapoint to the right of the line is designated as belonging to the green clusters.

This model has the parameter gamma which can visually be thought of as the inverse of the length of the blue line. A low value of gamma will therefore increase the width of the blue line meaning more datapoints are used as support vectors.

C is the regularization parameter which can be thought of as flexibility. In the case above a straight line is used which isn't very flexible. Increasing C will instead allow use a best fit curve opposed to a straight line.

Let's have a look at the support vector machine module here we see a number of estimator classes. Let's use the SVC estimator class (Support Vector Classifier).

Now let's instantiate it with open parenthesis so we can see the keyword input arguments.

We can use this estimator with the default values for the keyword arguments and note that the syntax is the same for the KNeighborsClassifier estimator class.

from sklearn.svm import SVC svc=SVC() svc.fit(X_train,y_train) y_pred=svc.predict(X_test)

The keyword input arguments of most interest this time are 'C' and 'gamma'. We can create a param_distributions dictionary for these, this time we will create a numpy array with powers of ten using a for loop.

cpowers=[10**c for c in np.arange(start=-5,stop=16,step=1, dtype=float)] gpowers=[10**g for g in np.arange(start=-15,stop=3,step=1, dtype=float)] param_distributions={'C':cpowers,'gamma':gpowers}

Now we can create a model of svc instances with these parameters using the RandomizedSearchCV model class.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.svm import SVC svc=SVC() cpowers=[10**c for c in np.arange(start=-5,stop=16,step=1, dtype=float)] gpowers=[10**g for g in np.arange(start=-15,stop=3,step=1, dtype=float)] param_distributions={'C':cpowers, 'gamma':gpowers} from sklearn.model_selection import RandomizedSearchCV modelsvc=RandomizedSearchCV(svc,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) modelsvc.fit(X,y) cv_results=modelsvc.cv_results_ best_score=modelsvc.best_score_ best_params=modelsvc.best_params_ best_estimator=modelsvc.best_estimator_

In this case the best_params are C=100 and gamma=0.001 which gives an equivalent best_score of 0.980. Unsurprisingly y5_pred once again gives classification 0 for the first datapoint and classification 2 for the second datapoint as expected.

In this case it iterated through the following 10 values (set by the keyword input argument n_iter=10).

The mean test scores for these parameters were.

### The Logistic Regression Predictor Class

Let's return to the plot where we used PCA to reduce the number of dimensions to 1 and where we plotted the classification on the y-axis.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=1) x2=pca.fit_transform(X) fig9,ax9=plt.subplots() ax9.scatter(x2[:,0][y==0],np.zeros(50),color='b') ax9.scatter(x2[:,0][y==1],np.ones(50),color='r') ax9.scatter(x2[:,0][y==2],2*np.ones(50),color='g') ax9.set_xlabel('feature 0') ax9.set_ylabel('classification')

Now for simplicity, let's look at just two of the classifications 0 and 1 (blue and red).

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=1) y2=y[:100] x2=pca.fit_transform(X)[:100] fig9,ax9=plt.subplots() ax9.scatter(x2[:,0][y2==0],np.zeros(50),color='b') ax9.scatter(x2[:,0][y2==1],np.ones(50),color='r') ax9.set_xlabel('feature 0') ax9.set_ylabel('classification')

We now want a mathematical expression to pass through all of these data points.

The linear_model library has a number of estimators and predictors that can be used to fit this kind of data. We can try using a LinearRegression first.

We can select a linear fit.

from sklearn.linear_model import LogisticRegression logreg=LogisticRegression(solver='liblinear') logreg.fit(x2,y2) y2_pred_proba=logreg.predict_proba(x2) y2_pred=logreg.predict(x2)

We can then plot this.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=1) y2=y[:100] x2=pca.fit_transform(X)[:100] from sklearn.linear_model import LinearRegression linreg=LinearRegression() linreg.fit(x2,y2) x3=np.reshape(np.arange(start=-3.2,stop=1.6,step=0.1),(-1,1)) y3_pred_lin=linreg.predict(x3) fig9,ax9=plt.subplots() ax9.scatter(x2[:,0][y2==0],np.zeros(50),color='b') ax9.scatter(x2[:,0][y2==1],np.ones(50),color='r') ax9.plot(x3,y3_pred_lin,color='k') ax9.set_xlabel('feature 0') ax9.set_ylabel('classification')

Here we see the model has fitted a straight line between the datapoints however the straight line is continuous and the values -0.2 and 1.2 don't really make sense when it comes to a classification problem.

We instead want a mathematical expression that starts at 0, stays at 0 for a while and then rapidly rises to 1 and stays at 1. A LogisticRegression is therefore likely to be more suitable.

from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() logreg.fit(x2,y2) x3=np.reshape(np.arange(start=-3.2,stop=1.6,step=0.1),(-1,1)) y3_pred_log=logreg.predict(x3)

We can then plot this.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=1) y2=y[:100] x2=pca.fit_transform(X)[:100] from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() logreg.fit(x2,y2) x3=np.reshape(np.arange(start=-3.2,stop=1.6,step=0.1),(-1,1)) y3_pred_log=logreg.predict(x3) fig9,ax9=plt.subplots() ax9.scatter(x2[:,0][y2==0],np.zeros(50),color='b') ax9.scatter(x2[:,0][y2==1],np.ones(50),color='r') ax9.plot(x3,y3_pred_log,color='k') ax9.set_xlabel('feature 0') ax9.set_ylabel('classification')

LogisticRegression is a predictor class and also has the method predict_proba which will give the probability for each classification.

y3_pred_proba_log=logreg.predict_proba(x3)

We can display this in the variable explorer.

We can plot this.

fig9,ax9=plt.subplots() ax9.plot(x3,y3_pred_proba_log[:,0],color='b') ax9.plot(x3,y3_pred_proba_log[:,1],color='r') ax9.set_xlabel('feature 0') ax9.set_ylabel('probability')

Let's return to having the three classifications. Let's then plot the original data, the probabilities calculated and the final classifications.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] from sklearn.decomposition import PCA pca=PCA(n_components=1) x2=pca.fit_transform(X) from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() logreg.fit(x2,y) x3=np.reshape(np.arange(start=-3.2,stop=4.2,step=0.1),(-1,1)) y3_pred_log=logreg.predict(x3) y3_pred_proba_log=logreg.predict_proba(x3) fig9,axes=plt.subplots(nrows=3,ncols=1) axes[0].scatter(x2[:,0][y==0],np.zeros(50),color='b') axes[0].scatter(x2[:,0][y==1],np.ones(50),color='r') axes[0].scatter(x2[:,0][y==2],2*np.ones(50),color='g') axes[0].set_xlabel('feature 0') axes[0].set_ylabel('classification') axes[1].plot(x3,y3_pred_proba_log[:,0],color='b') axes[1].plot(x3,y3_pred_proba_log[:,1],color='r') axes[1].plot(x3,y3_pred_proba_log[:,2],color='g') axes[1].set_xlabel('feature 0') axes[1].set_ylabel('probability') axes[2].scatter(x3,y3_pred_log,color='k') axes[2].set_xlabel('feature 0') axes[2].set_ylabel('predicted classification')

We used 1D data to visualize how logistic regression works however we can use all features when fitting the model.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_iris iris=load_iris() y=iris['target'] X=iris['data'] X5=X[[33,133],:]+0.1 from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() cpowers=[10**c for c in np.arange(start=0,stop=11,step=1, dtype=float)] param_distributions={'C':cpowers,'solver':['liblinear'], 'random_state':[1]} LogisticRegression() from sklearn.model_selection import RandomizedSearchCV logregmodel=RandomizedSearchCV(logreg,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) logregmodel.fit(X,y) cv_results=logregmodel.cv_results_ best_score=logregmodel.best_score_ best_params=logregmodel.best_params_ best_estimator=logregmodel.best_estimator_

In this case the best_params are C=100 which gives an equivalent best_score of 0.980. Unsurprisingly y5_pred once again gives classification 0 for the first datapoint and classification 2 for the second datapoint as expected.

## The Wine Example Bunch (13 Features and Target Classification)

The wine dataset is a dataset which measures a number of different features of wine and classifies them with respect to their quality. Let's load the dataset.

from sklearn.datasets import load_wine wine=load_wine()

Now let's explore it within the variable explorer.

We can get a description of the dataset by opening up DESCR.

Let's first look at the DESCR, here we see that we have 13 features and although the DESCR states 178 observations. This is a classification problem and we have 3 discrete categories.

Let's have a look at the numerical data and proscribe it to X.

X=wine['data']

Their numeric value is contained within the vector target which we can proscribe to y.

y=wine['target']

### The Principle Component Analysis Transformer Class

In order to visualize this data we can once again use the PCA transformer class to reduce the number of dimensions to 3 which we can plot.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(4) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.decomposition import PCA pca=PCA(n_components=3) X2=pca.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111, projection='3d') colors=['r','g','b'] for i in range(0,3,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], X2[:,2][y==i], s=5,color=colors[i]) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.set_zlabel('feature 2') ax1.legend(['0','1','2'], loc='upper left')

And here we can see that we have 3 distinct clusters and the fact that they appear to be distinct is a good sign as it gives us a starting point to visually classify them.

We can also try using 2 dimensions:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111) colors=['r','g','b'] for i in range(0,3,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], s=5,color=colors[i]) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.legend(['0','1','2'], loc='lower left')

### The Isomap Transformer Class

We can try dimensionality reduction also with the Isomap transformer class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(4) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.manifold import Isomap iso=Isomap(n_components=3) X2=iso.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111, projection='3d') colors=['r','g','b'] for i in range(0,3,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], X2[:,2][y==i], s=5,color=colors[i]) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.set_zlabel('feature 2') ax1.legend(['0','1','2'], loc='upper left')

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.manifold import Isomap iso=Isomap(n_components=2) X2=iso.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111) colors=['r','g','b'] for i in range(0,3,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], s=5,color=colors[i]) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.legend(['0','1','2'], loc='lower left')

### The K Nearest Neighbors Classifier Estimator Class

We can try to split the data up into a testing and training dataset to use with the KNearestNeighbors estimator class, starting with n_neighbors=1. We can then calculate an accuracy score and confusion matrix.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X_train,y_train) y_pred=knn.predict(X_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

The acc_score=0.711 suggesting a number of misclassifications. Let's examine the confusion matrix.

The acc_score as demonstrated earlier has a high variance. So we can once again perform a Randomized Search with 10-fold cross validation to find the best parameter. The best parameters were found to be n_neighbors=20 and weights='distance' with a best_score=0.748 which is slightly more accurate but still has a large number of misclassifications.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_distributions={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import RandomizedSearchCV modelknn=RandomizedSearchCV(knn,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) modelknn.fit(X,y) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

### The Naive Bayes Classifier Estimator Class

We can try using the Naive Bayes Estimator Class using 10 fold cross validation.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.naive_bayes import GaussianNB gaussiannb=GaussianNB() gaussiannb.fit(X,y) from sklearn.model_selection import cross_val_score scores=cross_val_score(gaussiannb,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

The score_mean=0.978 suggesting this is a substantially better model to use for this dataset. Let's split the model into training and testing data and compute a confusion matrix.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.naive_bayes import GaussianNB gaussiannb=GaussianNB() gaussiannb.fit(X_train,y_train) y_pred=gaussiannb.predict(X_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

We can see that all the classifications are correctly classified and that the GaussianNB estimator works much better than the KNeighborsClassifier for this data.

### The Random Forest Classifier Model of Estimator Class

We can try the Random Forests Classifier Model of Estimator Class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.ensemble import RandomForestClassifier randforest=RandomForestClassifier() randforest.fit(X,y) from sklearn.model_selection import cross_val_score scores=cross_val_score(randforest,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

The score_mean=0.977 giving very similar performance to the GaussianNB estimator.

### The Support Vector Classifier Estimator Class

We can try the SVC Estimator Class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.svm import SVC svc=SVC() cpowers=[10**c for c in np.arange(start=-5,stop=16,step=1, dtype=float)] gpowers=[10**g for g in np.arange(start=-15,stop=3,step=1, dtype=float)] param_distributions={'C':cpowers, 'gamma':gpowers} from sklearn.model_selection import RandomizedSearchCV modelsvc=RandomizedSearchCV(svc,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) modelsvc.fit(X,y) cv_results=modelsvc.cv_results_ best_score=modelsvc.best_score_ best_params=modelsvc.best_params_ best_estimator=modelsvc.best_estimator_

The score_mean=0.921 which is better than the KNeighborsClassifier estimator but worse than the GaussianNB and RandomForestClassifier estimator.

The best_params were C=10000000 and gamma=1e=9.

### The Logistic Regression Predictor Class

We can now try the Logistic Regression Predictor Class:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_wine wine=load_wine() X=wine['data'] y=wine['target'] from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() cpowers=[10**c for c in np.arange(start=0,stop=11,step=1, dtype=float)] param_distributions={'C':cpowers,'solver':['liblinear'], 'random_state':[1]} LogisticRegression() from sklearn.model_selection import RandomizedSearchCV logregmodel=RandomizedSearchCV(logreg,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) logregmodel.fit(X,y) cv_results=logregmodel.cv_results_ best_score=logregmodel.best_score_ best_params=logregmodel.best_params_ best_estimator=logregmodel.best_estimator_

The score_mean=0.956 which is better than the KNeighborsClassifier and SVC estimator but worse than the GaussianNB and RandomForestClassifier estimator.

Out of the estimators tested the GaussianNB estimator performs the best for this dataset.

## The Digits Example Bunch (64 Features and Target Classification)

The digits dataset is a practice dataset for Optical Character Recognition (OCR) using machine learning. In this problem we want to classify a image of written text into a numeric digit. Let's load the dataset.

from sklearn.datasets import load_digits digits=load_digits()

Now let's explore it within the variable explorer.

We can get a description of the dataset by opening up DESCR

Let's first look at the DESCR, here we see that we have 64 features and although the DESCR states 5620 observations we only get 1797. This is a classification problem and we have 10 discrete categories.

The 3D array images consists of 1797 grey scale images that are 8 pixels by 8 pixels. The shape is (1797,8,8) or 1797 pages by 8 rows by 8 cols. We can open within the variable explorer. By default we are scrolling through axis 0 which corresponds to the page or each individual image. Notice that the intensities range from 0-15 meaning 16 levels i.e. a 16 bit grey-scale image. The background color on the variable explorer gives an indication of the image.

With matplotlib set to automatic, we can display one of these as an image using imshow and the bone color map.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_digits digits=load_digits() images=digits['images'] plt.imshow(images[0,:,:],cmap='bone')

We can use the reverse colormap bone_r to instead have dark text and a light background.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_digits digits=load_digits() images=digits['images'] plt.imshow(images[0,:,:],cmap='bone_r')

We can use a for loop to index over a series of images and pause for a second giving us time to view the images.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_digits digits=load_digits() images=digits['images'] for i in range(10): plt.imshow(images[i,:,:],cmap='bone_r') plt.pause(1)

In this case we (as humans can clearly recognize or classify) the digits 0,1,2,3,4,5,6,7,8 and 9.

Their numeric value is contained within the vector target which we can proscribe to y.

y=digits['target']

Each image has 8 rows by 8 columns or 16 bit pixels and each pixel is a feature.

We can create 64 features for image 0 by flattening the matrix. This will append each row. We can then explicitly specify a row vector using the fucntion reshape:

img0=images[0,:,:].flatten() img0=np.reshape(img0,(1,-1))

We can use the function reshape to reshape the entire 3 dimensional array into 2 dimensions.

data=np.reshape(images,(64,1797))

This is of course the same as the data matrix within the digits bunch and this is our numeric matrix of features X.

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data']

### The Principle Component Analysis Transformer Class

In order to visualize this data we can once again use the PCA transformer class to reduce the number of dimensions to 3 which we can plot.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.decomposition import PCA pca=PCA(n_components=3) X2=pca.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111, projection='3d') for i in range(0,10,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], X2[:,2][y==i], s=0.5,color=random.rand(3)) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.set_zlabel('feature 2') ax1.legend(['0','1','2','3','4','5','6','7','8','9'], loc='upper left')

And here we can see that we have 9 distinct clusters and the fact that they appear to be distinct is a good sign as it gives us a starting point to visually classify them.

We can also try using 2 dimensions:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111) for i in range(0,10,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], s=0.5,color=random.rand(3)) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.legend(['0','1','2','3','4','5','6','7','8','9'], loc='lower left')

### The Isomap Transformer Class

We can try dimensionality reduction also with the Isomap transformer class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.manifold import Isomap iso=Isomap(n_components=3) X2=iso.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111, projection='3d') for i in range(0,10,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], X2[:,2][y==i], s=0.5,color=random.rand(3)) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.set_zlabel('feature 2') ax1.legend(['0','1','2','3','4','5','6','7','8','9'], loc='upper left')

We can see that the data is more east to visually distinguish using Isomap compared to PCA.

We can also try using 2 dimensions:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from scipy import random random.seed(17) from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.manifold import Isomap iso=Isomap(n_components=2) X2=iso.fit_transform(X) fig1=plt.figure() ax1=fig1.add_subplot(111) for i in range(0,10,1): ax1.scatter(X2[:,0][y==i], X2[:,1][y==i], s=0.5,color=random.rand(3)) ax1.set_xlabel('feature 0') ax1.set_ylabel('feature 1') ax1.legend(['0','1','2','3','4','5','6','7','8','9'], loc='lower left')

### The K Nearest Neighbors Classifier Estimator Class

We can try to split the data up into a testing and training dataset to use with the KNearestNeighbors estimator class, starting with n_neighbors=1. We can then calculate an accuracy score and confusion matrix.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X_train,y_train) y_pred=knn.predict(X_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

The acc_score=0.988 suggesting that we were mainly able to easily classify and identify the written digits. Let's examine the confusion matrix, we see that most of the values lie on the diagonal and so were correctly classified however we can see a handful of misclassifications.

A 9 was misclassified as a 3, a 2 misslassified as a 7, a 7 misclassfied as a 9 and two 5's were misclassified as 9's.

The acc_score as demonstrated earlier has a high variance. So we can once again perform a Randomized Search with 10-fold cross validation to find the best parameter. The best parameters were found to be n_neighbors=2 and weights='uniform' with a best_score=0.974.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_distributions={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import RandomizedSearchCV modelknn=RandomizedSearchCV(knn,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) modelknn.fit(X,y) cv_results=modelknn.cv_results_ best_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

### The Naive Bayes Classifier Estimator Class

We can try using the Naive Bayes Estimator Class using 10 fold cross validation.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.naive_bayes import GaussianNB gaussiannb=GaussianNB() gaussiannb.fit(X,y) from sklearn.model_selection import cross_val_score scores=cross_val_score(gaussiannb,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

The score_mean=0.811 suggesting this is a substantially worse model to use for this dataset. Just out of curiosity we can compute a confusion matrix to get a feeling where the estimator has misclassified the test data.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.naive_bayes import GaussianNB gaussiannb=GaussianNB() gaussiannb.fit(X_train,y_train) y_pred=gaussiannb.predict(X_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

### The Random Forest Classifier Model of Estimator Class

We can try the Random Forests Classifier Model of Estimator Class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.ensemble import RandomForestClassifier randforest=RandomForestClassifier() randforest.fit(X,y) from sklearn.model_selection import cross_val_score scores=cross_val_score(randforest,X,y,cv=10,scoring='accuracy') score_mean=np.mean(scores) score_std=np.std(scores)

The score_mean=0.951 suggesting this is better than the Naive Bayes Classifier Estimator Class but still performs slightly worse than the Nearest Neighbors Classifier Estimator Class.

### The Support Vector Classifier Estimator Class

We can try the SVC Estimator Class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.svm import SVC svc=SVC() cpowers=[10**c for c in np.arange(start=-5,stop=16,step=1, dtype=float)] gpowers=[10**g for g in np.arange(start=-15,stop=3,step=1, dtype=float)] param_distributions={'C':cpowers, 'gamma':gpowers} from sklearn.model_selection import RandomizedSearchCV modelsvc=RandomizedSearchCV(svc,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) modelsvc.fit(X,y) cv_results=modelsvc.cv_results_ best_score=modelsvc.best_score_ best_params=modelsvc.best_params_ best_estimator=modelsvc.best_estimator_

The score_mean=0.981 which gives the best score so far. The best_params were C=100 and gamma=0.001. Let's split our data into training and testing data and create a confusion matrix with these parameters.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.svm import SVC svc=SVC(C=100,gamma=0.001) svc.fit(X_train,y_train) y_pred=svc.predict(X_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

We see slightly less classifications as expected.

### The Logistic Regression Predictor Class

We can now try the Logistic Regression Predictor Class:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_digits digits=load_digits() y=digits['target'] X=digits['data'] from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() cpowers=[10**c for c in np.arange(start=0,stop=11,step=1, dtype=float)] param_distributions={'C':cpowers,'solver':['liblinear'], 'random_state':[1]} LogisticRegression() from sklearn.model_selection import RandomizedSearchCV logregmodel=RandomizedSearchCV(logreg,param_distributions, n_iter=10,scoring='accuracy', cv=10,random_state=1) logregmodel.fit(X,y) cv_results=logregmodel.cv_results_ best_score=logregmodel.best_score_ best_params=logregmodel.best_params_ best_estimator=logregmodel.best_estimator_

We get a convergence warning when using this dataset with this estimator suggesting it is an unsuitable estimator for this data. Logistic Regression usually works better when the target data is binary (0 or 1).

## The Breast Cancer Example Bunch (30 Features and Binary Target Classification)

The breast cancer dataset is a practice dataset taken from the medical diagnostics fields. In this problem we have a number of features and a binary target (0 no cancer) and (1 has cancer). Let's import the function from the datasets module and call the function to assign the bunch to an object name.

from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer()

Let's open the breast_cancer bunch within the variable explorer:

Then have a look at the DESCR.

We get details about all the features of which there are 30 and we get all the details about the features. Now unless you are a medical diagnostics expert you can get lost reading the human description of all the features. However as a data scientist we will treat all these features as numeric data and disregard their physical meaning.

We get the fact that in the end we have two classifications malignant (1 with cancer) and benign (0 no cancer).

Ignoring the physical definitions of all the features, let's look at only at the data and target. We see these are all numeric.

Let's now assign these to X and y respectively.

from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target']

The number of features in this case are too large to view in a pair plot. However we can plot the correlation coefficient of each in the form of a heatmap.

For this we need to combine X and y into a pandas dataframe. Then we can use the dataframe method corr to get the correlation coefficient and the seaborn function heatmap.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import seaborn as sns from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] names=list(breast_cancer['feature_names'])+['target'] alldata=np.concatenate([X,np.reshape(y,(-1,1))],axis=1) alldatadf=pd.DataFrame(alldata,columns=names) correlation=alldatadf.corr() sns.heatmap(correlation,annot=True,fmt='.0%')

Since the target is what we are interested in, we will examinethe target for signs of correlation. The features that have a low magnitude of correlation of 1%, 1%, 7%, 1% and 8% are likely to be of the least interest.

Let's mask these off. The next thing to notice is that there is 95-100 % cross correlation in values off the diagonal, 5 in the case of mean radius.

This means these features might be telling us duplicate information. So if we mask off these off, we may only have the following information (in terms of rows) to classify our problem.

### The Principle Component Analysis Transformer Class

As this is a classification problem let's try reducing the number of dimensions so we can visualize our data.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.decomposition import PCA pca=PCA(n_components=3) X2=pca.fit_transform(X) fig3=plt.figure() ax3=fig3.add_subplot(111, projection='3d') ax3.scatter(X2[:,0][y==0], X2[:,1][y==0], X2[:,2][y==0], color='b') ax3.scatter(X2[:,0][y==1], X2[:,1][y==1], X2[:,2][y==1], color='r') ax3.set_xlabel('feature 0') ax3.set_ylabel('feature 1') ax3.set_zlabel('feature 2')

We see two clusters here and most of the values are well-separated which is a good sign for this classification problem. Note however that there is a substantial overlap which means there is a high chance of misclassification here but we may be able to distinguish these better in higher dimensions.

We can also reduce the number of dimensions down to 2.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) fig4,ax4=plt.subplots() ax4.scatter(X2[:,0][y==0],X2[:,1][y==0],color='b',alpha=0.2) ax4.scatter(X2[:,0][y==1],X2[:,1][y==1],color='r',alpha=0.2) ax4.set_xlabel('feature 0') ax4.set_ylabel('feature 1')

### The Isomap Transformer Class

Let's try using the Isomap Transformer Class to create a 3D plot and a 2D plot.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.manifold import Isomap iso=Isomap(n_components=3) X2=iso.fit_transform(X) fig3=plt.figure() ax3=fig3.add_subplot(111, projection='3d') ax3.scatter(X2[:,0][y==0], X2[:,1][y==0], X2[:,2][y==0], color='b',alpha=0.2) ax3.scatter(X2[:,0][y==1], X2[:,1][y==1], X2[:,2][y==1], color='r',alpha=0.2) ax3.set_xlabel('feature 0') ax3.set_ylabel('feature 1') ax3.set_zlabel('feature 2')

We se some slightly better separation but these is still a substantial overlap.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.manifold import Isomap iso=Isomap(n_components=2) X2=iso.fit_transform(X) fig4,ax4=plt.subplots() ax4.scatter(X2[:,0][y==0],X2[:,1][y==0],color='b',alpha=0.2) ax4.scatter(X2[:,0][y==1],X2[:,1][y==1],color='r',alpha=0.2) ax4.set_xlabel('feature 0') ax4.set_ylabel('feature 1')

### The K Nearest Neighbors Classifier Estimator Class

We can try to split the data up into a testing and training dataset to use with the KNearestNeighbors estimator class, starting with n_neighbors=1. We can then calculate an accuracy score and confusion matrix.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) knn.fit(X_train,y_train) y_pred=knn.predict(X_test) from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

The acc_score=0.923 suggesting that we were mainly able to classify 0 (no cancer) from 1 (cancer).

Let's examine the confusion matrix, we see that most of the values lie on the diagonal and so were correctly classified however we can see a handful of misclassifications.

- We have 48 Negatives that were tested Negative. These are known as True Negatives TN.
- We also have 84 Positives that were tested Positive. These are known as True Positives TP.
- We have however got 4 Positives that were tested Negative. These are known as False Negatives FN.
- We have 7 Negatives that were tested Positive. These are known as False Positives FP.

The metric we looked at earlier was the accuracy:

accuracy=(TP+TN)/(TP+TN+FP+FN)=(84+48)/(84+48+7+4)=0.923

from sklearn.metrics import accuracy_score acc_score=accuracy_score(y_test,y_pred)

In terms of a medical diagnostics test the (False Negatives) are the biggest concern as we want to reduce the likelihood of telling a cancerous patient that they are healthy as it will delay additional testing or treatment which could potentially be life-saving.

There are a number of other commonly used metrics such as.

The sensitivity (otherwise known as the recall):

sensitivity=TP/(TP+FN)=84/(84+4)=0.955

from sklearn.metrics import recall_score rec_score=recall_score(y_test,y_pred)

The precision:

precision=TP/(TP+FP)=84/(84+7)=0.923

from sklearn.metrics import precision_score prec_score=precision_score(y_test,y_pred)

The f1 score:

f1=2*(sensitivity*precision) / (sensitivity+precision)=2*(0.955*0.923)/(0.955+0.923)=0.923

from sklearn.metrics import f1_score f_one_score=recall_score(y_test,y_pred)

All the metrics above are quite similar and this is in part due to the number of classifications, classified as 1 being of a similar size (slightly larger) than the number of classifications being 0. In other medical datasets particular for datasets where a disease in relatively rare, the proportion of classifications being 0 could be much higher and this will create a substantial difference in these metrics.

Let's demonstrate this by multiplying the first row of the confusion matrix by 20.

Let's look at the following metrics:

accuracy=(TP+TN)/(TP+TN+FP+FN)=(84+960)/(84+960+140+4)=0.879

sensitivity=TP/(TP+FN)=84/(84+4)=0.955

precision=TP/(TP+FP)=84/(84+140)=0.375

f1=2*(sensitivity*precision) / (sensitivity+precision)=2*(0.955*0.375)/(0.955+0.375)=0.539

However let us now change the test data in the row to emulate a test that gave a much higher proportion of False Negatives and hence a lower occurrence of True Positives.

Now let's calculate the updated metrics:

accuracy=(TP+TN)/(TP+TN+FP+FN)=(84+960)/(84+960+140+4)=0.833

sensitivity=TP/(TP+FN)=30/(30+58)=0.340

precision=TP/(TP+FP)=30/(30+140)=0.176

f1=2*(sensitivity*precision) / (sensitivity+precision)=2*(0.340*0.176)/(0.340+0.176)=0.232

Notice the difference in accuracy is marginally changed due to the much higher prevalence of classification 0. The other metrics are more sensitive to this change. In particular the sensitivity metric.

Returning back to the original breast_cancer data, we will now use a RandomizedSearchCV with 10 fold cross-validation to find the optimal 'N_neighbors' and 'weights' however we will optimize using the sensitivity metric scoring='recall' in line 21.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier() param_distributions={'n_neighbors':np.arange(start=1,stop=31,step=1), 'weights':['uniform','distance']} from sklearn.model_selection import RandomizedSearchCV modelknn=RandomizedSearchCV(knn,param_distributions, n_iter=10,scoring='recall', cv=10,random_state=1) modelknn.fit(X,y) cv_results=modelknn.cv_results_ best_recall_score=modelknn.best_score_ best_params=modelknn.best_params_ best_estimator=modelknn.best_estimator_

The best_recall_score=0.974 and the best_params are n_neighbors=25 and weights='uniform'. We can use these keyword input arguments to calculate a new confusion matrix. We will display it side by side with the original confusion matrix.

Note that the number of False Negatives is down from 4 to 3 and the number of True Positives is up from 84 to 85 which is the overall trend we are looking for when it comes to a medical diagnostic as it means doctors will now look at a patient who has breast cancer opposed to telling her that she is fine. A consequence of this is the number of False Positives have also went up from 7 to 10 and this unfortunately means Doctors will examine patients who are otherwise healthy again. However in the context of a medical diagnostics the potential of catching and thus treating malignant cancer in one patient to potentially save her life vastly exceeds the inconvenience of further testing or examination of 3 benign cases. A recall or specificity rate of 0.974 97.4 % is high but perhaps some of the additional models can get it up slightly more which could result in another life being saved.

### The Naive Bayes Classifier Estimator Class

We can try using the Naive Bayes Estimator Class using 10 fold cross validation.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.naive_bayes import GaussianNB gaussiannb=GaussianNB() gaussiannb.fit(X,y) from sklearn.model_selection import cross_val_score scores=cross_val_score(gaussiannb,X,y,cv=10,scoring='recall') rec_score_mean=np.mean(scores) score_std=np.std(scores)

The rec_score_mean=0.966 which gives worse performance than the KNearestNeighborsClasifier estimator class.

### The Random Forest Classifier Model of Estimator Class

We can try the Random Forests Classifier Model of Estimator Class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.ensemble import RandomForestClassifier randforest=RandomForestClassifier() randforest.fit(X,y) from sklearn.model_selection import cross_val_score scores=cross_val_score(randforest,X,y,cv=10,scoring='recall') rec_score_mean=np.mean(scores) score_std=np.std(scores)

The rec_score_mean=0.977 which gives slightly better performance than the KNearestNeighborsClasifier estimator class. Let's use it to compute a confusion matrix:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.ensemble import RandomForestClassifier randforest=RandomForestClassifier() randforest.fit(X_train,y_train) y_pred=randforest.predict(X_test) from sklearn.metrics import recall_score rec_score=recall_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

In this case the number of False Negatives is down to 2 and as a bonus the number of False Positives is also down to 5 performing better than the KNearestNeighborsClasifier estimator class.

### The Support Vector Classifier Estimator Class

We can try the SVC Estimator Class.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.svm import SVC svc=SVC() cpowers=[10**c for c in np.arange(start=-5,stop=16,step=1, dtype=float)] gpowers=[10**g for g in np.arange(start=-15,stop=3,step=1, dtype=float)] param_distributions={'C':cpowers, 'gamma':gpowers} from sklearn.model_selection import RandomizedSearchCV modelsvc=RandomizedSearchCV(svc,param_distributions, n_iter=10,scoring='recall', cv=10,random_state=1) modelsvc.fit(X,y) cv_results=modelsvc.cv_results_ best_rec_score=modelsvc.best_score_ best_params=modelsvc.best_params_ best_estimator=modelsvc.best_estimator_

Here the best_rec_score is 1.0 and the best_params are C=1000000 and gamma=10 so let's try using these to compute a confusion matrix.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=1) from sklearn.svm import SVC svc=SVC(C=1000000,gamma=10) svc.fit(X_train,y_train) y_pred=svc.predict(X_test) from sklearn.metrics import recall_score rec_score=recall_score(y_test,y_pred) from sklearn.metrics import confusion_matrix con_mat=confusion_matrix(y_test,y_pred)

Here we see that the results are too good to be true. In this case all the data has been prescribed as testing Positive (True Positive or False Positive). This makes the medical diagnostic following this model not useful.

### The Logistic Regression Predictor Class

We can now try the Logistic Regression Predictor Class:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_breast_cancer breast_cancer=load_breast_cancer() X=breast_cancer['data'] y=breast_cancer['target'] from sklearn.svm import SVC svc=SVC(C=1000000,gamma=10) from sklearn.linear_model import LogisticRegression logreg=LogisticRegression() cpowers=[10**c for c in np.arange(start=0,stop=11,step=1, dtype=float)] param_distributions={'C':cpowers,'solver':['liblinear'], 'random_state':[1]} LogisticRegression() from sklearn.model_selection import RandomizedSearchCV logregmodel=RandomizedSearchCV(logreg,param_distributions, n_iter=10,scoring='recall', cv=10,random_state=1) logregmodel.fit(X,y) cv_results=logregmodel.cv_results_ best_rec_score=logregmodel.best_score_ best_params=logregmodel.best_params_

The best_rec_score=0.975 which gives compatible performance to KNearestNeighborsClasifier estimator class however in this case the RandomForestClassifier model of estimators seems to give the best results.

## The Diabetes Example Bunch (10 Features and Target Regression)

The diabetes dataset is a practice dataset taken from the medical diagnostics fields. In this problem we have 13 features and a continuous. Let's import the function from the datasets module and call the function to assign the bunch to an object name.

from sklearn.datasets import load_diabetes diabetes=load_diabetes()

Let's open the diabetes bunch within the variable explorer:

Then have a look at the DESCR.

We get details about the 10 features and the target feature is continuous and is a measure of the progression of the disease. While it is possible to transform the target data into a number of discrete categories and treat this dataset as a classification problem, it better to approach the data as a regression problem. We see some of features (age, sex) are uncontrollable however some of the other features (body mass index and average blood pressure) can be changed with diet and lifestyle. A doctor should advise a patient to have an optimal lifestyle in order to reduce the progression of the disease.

Let's now assign these to X and y respectively.

from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target']

Note that the data in this dataset already appears to be transformed using the standard scaler and each feature has a mean value of 0 and a standard deviation of 1.

Let's convert the data into a dataframe, calculate the autocorrelation of the dataframe and plot it as a heatmap.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import seaborn as sns import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] names=list(diabetes['feature_names'])+['target'] alldata=np.concatenate([X,np.reshape(y,(-1,1))],axis=1) alldatadf=pd.DataFrame(alldata,columns=names) correlation=alldatadf.corr() fig1=plt.figure(1) ax1=fig1.add_subplot(111) sns.heatmap(correlation,annot=True,fmt='.0%',ax=ax1)

We are interested in the target (last column) and we see that the feature bmi has the strongest correlation at 0.59 (59 %). Let's try to plot this.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] fig2=plt.figure(2) ax2=fig2.add_subplot(111) ax2.scatter(X[:,2],y,s=5,color='b') ax2.set_xlabel('feature 2') ax2.set_ylabel('target') fig3=plt.figure(3) ax3=fig3.add_subplot(111) ax3.scatter(X[:,3],y,s=5,color='b') ax3.set_xlabel('feature 3') ax3.set_ylabel('target')

We can visualize a linear relationship between this data.

Let's also look at this data in 3D.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] fig4=plt.figure(4) ax4=fig4.add_subplot(111,projection='3d') ax4.scatter(X[:,2],X[:,3],y,color='b') ax4.set_xlabel('feature 2') ax4.set_ylabel('feature 3') ax4.set_zlabel('y')

We can view these two features and continuous target data as a plane and we cannot visualize the data in higher dimensions.

### The Digitize Function

We can use the digitize function from within the numpy library to create three categories 0 (low progression), 1 (medium progression) and 2 (high progression). We can then view this data using a pair plot (we will just select four features 'age', 'sex'. 'bmi', 'bp').

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd import seaborn as sns from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] y2=np.digitize(y,bins=[100,200,350]) names=list(diabetes['feature_names'])+['target'] alldata=np.concatenate([X,np.reshape(y2,(-1,1))],axis=1) alldatadf=pd.DataFrame(alldata,columns=names) datadf2=alldatadf[['age','sex','bmi','bp','target']] plot1=sns.pairplot(datadf2,hue='target')

### The Principle Component Analysis Transformer Class

As this is a classification problem let's try reducing the number of dimensions so we can visualize our data.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) fig5=plt.figure(5) ax5=fig5.add_subplot(111,projection='3d') ax5.scatter(X2[:,0],X2[:,1],y,color='b') ax5.set_xlabel('feature 0') ax5.set_ylabel('feature 1') ax5.set_zlabel('y')

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] from sklearn.decomposition import PCA pca=PCA(n_components=1) x2=pca.fit_transform(X) fig6=plt.figure(6) ax6=fig6.add_subplot(111) ax6.scatter(x2,y,s=5,color='b') ax6.set_xlabel('feature 0') ax6.set_ylabel('target')

### The Isomap Transformer Class

Let's try using the Isomap Transformer Class to create a 3D plot and a 2D plot.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] from sklearn.manifold import Isomap iso=Isomap(n_components=3) X2=iso.fit_transform(X) fig7=plt.figure(7) ax7=fig7.add_subplot(111,projection='3d') ax7.scatter(X2[:,0],X2[:,1],y,s=5,color='b') ax7.set_xlabel('feature 0') ax7.set_ylabel('feature 1') ax7.set_zlabel('y')

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] from sklearn.manifold import Isomap iso=Isomap(n_components=1) x2=iso.fit_transform(X) fig8=plt.figure(8) ax8=fig8.add_subplot(111) ax8.scatter(x2,y,s=5,color='b',alpha=0.2) ax8.set_xlabel('feature 0') ax8.set_ylabel('y')

### The Linear Regression Estimator Class

We can use the LinearRegression estimator estimator class to fit a straight line to the data and plot this straight line on each of the plots.

Lets do this for the plots of feature 2 and the target and feature 3 and the target.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] x2=X[:,2] x2=np.reshape(x2,(-1,1)) x3=X[:,3] x3=np.reshape(x3,(-1,1)) x2_ind=np.linspace(start=-0.15,stop=0.15,num=1000) x2_ind=np.reshape(x2_ind,(-1,1)) x3_ind=np.linspace(start=-0.15,stop=0.15,num=1000) x3_ind=np.reshape(x3_ind,(-1,1)) from sklearn.linear_model import LinearRegression reg2=LinearRegression() reg2.fit(x2,y) coef2=reg2.coef_[0] intercept2=reg2.intercept_ y2_pred=reg2.predict(x2_ind) from sklearn.linear_model import LinearRegression reg3=LinearRegression() reg3.fit(x3,y) coef3=reg3.coef_[0] intercept3=reg3.intercept_ y3_pred=reg3.predict(x3_ind) fig2=plt.figure(2) ax2=fig2.add_subplot(111) ax2.scatter(X[:,2],y,s=5,color='b') ax2.set_xlabel('feature 2') ax2.set_ylabel('target') ax2.plot(x2_ind,y2_pred,color='r') fig3=plt.figure(3) ax3=fig3.add_subplot(111) ax3.scatter(X[:,3],y,s=5,color='b') ax3.set_xlabel('feature 3') ax3.set_ylabel('target') ax3.plot(x3_ind,y3_pred,color='r')

The straight line for the target y with respect to the independent feature 2 gives a coef2=949 and intercept2=152.

The straight line for the target y with respect to the independent feature 3 gives a coef3=715 and intercept3=152.

We can also add a line to the 3D plot of feature 2, feature 3 and the target:

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] X2=X[:,[2,3]] x2_ind=np.linspace(start=-0.15,stop=0.15,num=1000) x2_ind=np.reshape(x2_ind,(-1,1)) x3_ind=np.linspace(start=-0.15,stop=0.15,num=1000) x3_ind=np.reshape(x3_ind,(-1,1)) X2_ind=np.concatenate((x2_ind,x3_ind),axis=1) from sklearn.linear_model import LinearRegression reg=LinearRegression() reg.fit(X2,y) coef_x2=reg.coef_[0] coef_x3=reg.coef_[1] y2_pred=reg.predict(X2_ind) fig4=plt.figure(4) ax4=fig4.add_subplot(111,projection='3d') ax4.scatter3D(X[:,2],X[:,3],y,s=5,color='b',alpha=0.2) ax4.scatter3D(x2_ind,x3_ind,y2_pred,s=2,color='r') ax4.set_xlabel('feature 2') ax4.set_ylabel('feature 3') ax4.set_zlabel('y')

The straight line for the target y with respect to the independent feature 2 and 3 gives a coef_x2=402, coef_x3=790 and intercept=152.

We can also add a line to the plot which used PCA with n_components=2 and the target.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] from sklearn.decomposition import PCA pca=PCA(n_components=2) X2=pca.fit_transform(X) x0_ind=np.linspace(start=-0.15,stop=0.15,num=1000) x0_ind=np.reshape(x0_ind,(-1,1)) x1_ind=np.linspace(start=-0.15,stop=0.15,num=1000) x1_ind=np.reshape(x1_ind,(-1,1)) X2_ind=np.concatenate((x0_ind,x1_ind),axis=1) from sklearn.linear_model import LinearRegression reg=LinearRegression() reg.fit(X2,y) coef_x0=reg.coef_[0] coef_x1=reg.coef_[1] intercept=reg.intercept_ y2_pred=reg.predict(X2_ind) fig5=plt.figure(5) ax5=fig5.add_subplot(111,projection='3d') ax5.scatter3D(X[:,2],X[:,3],y,s=5,color='b',alpha=0.2) ax5.scatter3D(x0_ind,x1_ind,y2_pred,s=2,color='r') ax5.set_xlabel('feature 0') ax5.set_ylabel('feature 1') ax5.set_zlabel('y')

The straight line for the target y with respect to the independent feature 0 and 1 gives a coef_x0=448, coef_x1=-256 and intercept3=152.

#### Train-Test Split Function

We seen the use of the auto-correlation which gauges how correlated a single feature was to the target (this can be visually thought as a measure of the distance each data point has from the line).

We can then import the functions mean_squared_error and r2_score and apply these by comparing the predicted y_pred to the known y values.

We want to reduce the mean_squared_error as much as possible.

The r2 score ranges from 0 (no correlation) to 1 (the data perfectly fits the line).

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] x2=X[:,2] x2=np.reshape(x2,(-1,1)) from sklearn.model_selection import train_test_split x2_train,x2_test,y2_train,y2_test=train_test_split(x2,y, test_size=0.25,train_size=0.75, random_state=4) from sklearn.linear_model import LinearRegression reg=LinearRegression() reg.fit(x2_train,y2_train) y2_pred=reg.predict(x2_test) coef=reg.coef_[0] intercept=reg.intercept_ from sklearn.metrics import mean_squared_error mse=mean_squared_error(y2_test,y2_pred) from sklearn.metrics import r2_score r2=r2_score(y2_test,y2_pred)

The coef=899 and intercept=151. The straight line fit has a mse=3062 and r2=0.442. This r2 is substantially above 0 and is a lifestyle parameter that can be changed with diet and exercise so doctors will advise patients to have a bmi within the healthy range to lower the progression of diabetes.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] x3=X[:,3] x3=np.reshape(x3,(-1,1)) from sklearn.model_selection import train_test_split x3_train,x3_test,y3_train,y3_test=train_test_split(x3,y, test_size=0.25,train_size=0.75, random_state=4) from sklearn.linear_model import LinearRegression reg=LinearRegression() reg.fit(x3_train,y3_train) y3_pred=reg.predict(x3_test) coef=reg.coef_[0] intercept=reg.intercept_ from sklearn.metrics import mean_squared_error mse=mean_squared_error(y3_test,y3_pred) from sklearn.metrics import r2_score r2=r2_score(y3_test,y3_pred)

The coef=780 and intercept=152. The straight line fit has a mse=5189 and r2=0.054. As expected the r2 for the x3 (bp) feature is much lower than in the case of the r2 for the x2 (bmi) feature as the correlation between the feature and the target is lower and the data does not follow the straight line as closely. The r2 is very close to 0 and thus suggests that this feature may not be very useful on its own to predict the progression of this disease.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] X2=X[:,[2,3]] from sklearn.model_selection import train_test_split X2_train,X2_test,y2_train,y2_test=train_test_split(X2,y, test_size=0.25,train_size=0.75, random_state=4) from sklearn.linear_model import LinearRegression reg=LinearRegression() reg.fit(X2_train,y2_train) y2_pred=reg.predict(X2_test) coef_x2=reg.coef_[0] coef_x3=reg.coef_[1] intercept=reg.intercept_ from sklearn.metrics import mean_squared_error mse=mean_squared_error(y2_test,y2_pred) from sklearn.metrics import r2_score r2=r2_score(y2_test,y2_pred)

The coef_x2=700, coef_x3=501 and intercept=152. The straight line fit has a mse=3387 and r2=0.382.

Although we can't imagine the multi-dimensional space conceptually, we can mathematically try to use all the features.

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d import pandas as pd from sklearn.datasets import load_diabetes diabetes=load_diabetes() X=diabetes['data'] y=diabetes['target'] from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.25,train_size=0.75, random_state=4) from