K-Nearest Neighbor(KNN) Algorithm In Machine Learning

Introduction To KNN Algorithm In Machine Learning

In this post we will talk about how the KNN Algorithm works in machine learning and What is the basic mathematical idea behind this algorithm.

We will discuss the main concept behind this algorithm and you will be amazed to see how a basic notion of finding the distance in mathematics becomes a powerful concept in machine learning.

The KNN Algorithm is no rocket science, It is simple, and follows a basic mathematical notion. 

For example, say you purchased a book-shelf and you decide to arrange the books according to their genre, you decided to put all the literature in one place, all the science fiction together and so on...

This simple task has a keen observation. Have you ever noticed how you categorize your books? 

It is simple, you pick one book, read the name of the book, or author of the book and you know that this name or author writes about science fiction. So, you put that book together with other science fiction books. 

In mathematics, we say that the book belongs to science fiction because it has similar properties when compared with other science fiction books.

What we observed is that the things with similar or the same properties come under one roof or lies near to each other. The same notion is with KNN Algorithm.

when we deal with a classification problem in machine learning we observe that data points with similar properties lie near to each other.

When we are using the word "near", It is important to understand the notion of "near" in mathematics. when do we say that an object or point is nearer? Simple, we find the distance between the points right?

In machine learning, We work with data-points. In our earlier posts, we discussed how linear algebra plays an important role in machine learning. So, In machine learning, the vectors or data-point with similar or nearly the same values lie near to each other!

Now, Keeping that observation in mind, we can predict the category of new data points by observing their neighbor data-points. And this concept is called a K-nearest neighbor, where we observe the neighbors of the data-point to be predicted.

Let's understand this algorithm with the IRIS Data-set Download the dataset from the given link: https://github.com/srujaneResearch/creationcodes/tree/master/datasets

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

data = pd.read_csv(r"Iris_Data.csv")

print(data.head(3))

print("The shape of the dataset: ", data.shape)

Output
       sepal_length  sepal_width  petal_length  petal_width      species

    0           5.1          3.5           1.4          0.2  Iris-setosa

    1           4.9          3.0           1.4          0.2  Iris-setosa

    2           4.7          3.2           1.3          0.2  Iris-setosa

    The shape of the dataset:  (150, 5)

   

The above dataset contains the dimensions of a flower such as sepal_length, sepal_width, petal_length, petal_width, and using these values it determines whether the flower is Iris-setosa or Iris-versicolor. 

Let's find out how many different categories we have in this data-set.

data['species'].value_counts()
Output
    Iris-setosa        50

    Iris-virginica     50

    Iris-versicolor    50

    Name: species, dtype: int64

For our understanding let's divide the data-set into its respective categories, and then we will plot each category to observe the datapoints.

dataVersicolor = data[data['species']=='Iris-versicolor']

dataVirginica = data[data['species']=='Iris-virginica']

dataSetosa = data[data['species']=='Iris-setosa']

I will plot the feature, sepal_length, and sepal_width of each category. Observe that flowers with similar properties and categories are near to each other.

plt.scatter(dataSetosa.iloc[:,0],dataSetosa.iloc[:,1],color='b',label='Iris-Setosa')

plt.scatter(dataVersicolor.iloc[:,0], dataVersicolor.iloc[:,1],color='r',label='Iris-Vesicolor')

plt.scatter(dataVirginica.iloc[:,0],dataVirginica.iloc[:,1],color='k', label='Iris-Virginica')

plt.legend(loc='upper right')
Output
    <matplotlib.legend.Legend at 0x16306a4d908>

From the above figure, observe that the Iris-Setosa type is aggregated in one place, Iris-Versicolor is near to each other and similarly the Iris-virginica. This happens because they share similar properties or the dimensions of their leaves and petals are nearly equal or similar.

Now, If a new data-point is to be predicted, then we can observe its nearest neighbor and tell whether the point belongs to Iris-setosa, Iris-versicolor, or Iris-Virginia.

But, how many nearest neighbors we should consider? Look At the following figure.

Suppose A data point with sepal_width = 2.0 and sepal_length = 4.5 is to be predicted. Now suppose the true answer is Iris-Versicolor. But Let's try to figure out our approach, we are going to see the nearest neighbor and tell the category.

I will again plot all the data-points and my new points [4.5, 2.0] in the below graph. The new point is indicated with the "X" marker.

plt.scatter(dataSetosa.iloc[:,0],dataSetosa.iloc[:,1],color='b',label='Iris-Setosa')

plt.scatter(dataVersicolor.iloc[:,0], dataVersicolor.iloc[:,1],color='r',label='Iris-Vesicolor')

plt.scatter(dataVirginica.iloc[:,0],dataVirginica.iloc[:,1],color='k', label='Iris-Virginica')

plt.scatter(4.5,2.0,marker='x',label='Iris-Vesicolor') #point to be predicted is denoted by x

plt.legend(loc='upper right')

    <matplotlib.legend.Legend at 0x16306b16c88>

If you observe the figure, "x" is the point we have to predict. 

In order to predict this, I have to calculate the distance of the point "x" from all the data-points in the above data-set. 

After calculating the distance from all the data-point I will pick the point with the minimum distance from our "x" and find out the category of that point.

Now, The question is, should I look or consider only a single neighbor? Or considering more neighbors can increase the probability of true prediction.

choosing the number of nearest points is what we called as determining the value of K or the number of the nearest neighbor.

It is very important that we chose this number carefully because our prediction depends upon that.

For example, if I take one nearest neighbor then, It can be the blue dot (Iris-Setosa) or It can be the red dot (Iris-Versicolor), And you can observe that the blue dot is nearest among them! So your algorithm may give you the output as Iris-Setosa which is wrong!

What we can do to solve this problem? 

We can consider more number of the nearest neighbor. 

Suppose if I take 3 nearest neighbors then we can get two red dots and one blue dots, this will increase the probability of the red dot, and the algorithm results in the output as iris-Versicolor.

If I take 5 nearest neighbors, then I get 3 red dots and one blue and black dots.

So, In KNN Algorithm, for successful prediction, you need to give the value of k very carefully.

Let's jump to code the above-discussed theory in python.

First, we need to convert the categorical values of species columns into numeric data types. For this, we will use the replace function.

data.replace({"species":{"Iris-setosa":0,"Iris-versicolor":1,"Iris-virginica":2}},inplace=True)

Now, Let's convert and split our data-frame into inputs and outputs.

array = data.values



X = array[:,:4]

Y = array[:,4]

Let's observe the X variable.

X[:5]
Output



    array([[5.1, 3.5, 1.4, 0.2],

           [4.9, 3. , 1.4, 0.2],

           [4.7, 3.2, 1.3, 0.2],

           [4.6, 3.1, 1.5, 0.2],

           [5. , 3.6, 1.4, 0.2]])

Let's take a minute to understand this input matrix.

The shape of our input matrix is

X.shape
Output
    (150, 4)


The shape we got is (150, 4). Now, what does this mean? 
This means that we have 150 different data-point and each data-point is having 4 coordinates. 

Now, what are these 4 coordinates? 

These four coordinates are nothing but our flower properties:
sepal-length, 
sepal-width, 
petal-length, and petal-width. 

Now, with the above considerations, I can treat every row of our input matrix as a vector! 

Why we did this?  This will help us to calculate the distance easily. 

Suppose a new flower comes with dimensions [5, 2, 3, 4]. 
Now, If I treat this as a vector v1 = [5, 2, 3, 4] this will help to calculate the distance from every data-point in the data-set easily.

For example, let's take a data-point from the above data-set v2 = [5.1, 3.5, 1.4, 0.2]. distance = |v1 - v2|. Now as we have covered the linear algebra part of this algorithm let's move forward. Let's create a function that can calculate the distance between two vectors.
import numpy.linalg as lg

def distance(v1, v2):

    return lg.norm(v1-v2) # fromula |v1-v2|

Now, Let's create a function that will calculate the distance matrix. this function will create a distance matrix that contains the distance between the newPoint and each and every data-point from our data-set.
def distanceMatrix(newpoint, Dataset):

   

    R, C = Dataset.shape # This will return a tuple with the shape of our dataset, in our case R = 150 and C=4

   

    distance_matrix = np.zeros((R,1)) # this will create a matrix of shape (150, 1) because we have total of 150 data-points.

   

    for i in range(R):

        distance_matrix[i] = distance(newpoint, Dataset[i])

       

    return distance_matrix

Now, for the sake of understanding,Let's take a newPoint = [[5.1, 3.5, 1.4, 0.2]]
newPoint = np.array([[5.1, 3.5, 1.4, 0.2]])

dist = distanceMatrix(newPoint, X)

dist.shape
    (150, 1)
Observe the dist variable. 
It contains the distance from all the points from our data-set. 

Now, we have to select the value of k or the value of the nearest neighbors we want to look at. 

For that, we need to sort this array so that we can get the k minimum distances. 

We will use the argsort function in numpy. argsort function takes a 1D array and returns the index values in the sorted order.
R, C = dist.shape

dist = dist.reshape((R,)) # we reshape the dist from (150,1) to (150, ). these two dimensions are very different and have different meanings. argsort works better with 1D array, so I converted the shapes.

K=3

K_nearest = np.argsort(dist)[:K] # we store the index of only 3 distances. Numpy slicing concept is used.

K_nearest
    array([ 0, 17,  4], dtype=int64)
We got the index position of all the data-points which are near to our new point. Now let's see the categories of them. Our output variable Y contains the categories. So we will pass these index values to Y and find out the categories in these indexes.
Y[K_nerest]
    array([0., 0., 0.])
Now according to this most of the categories near to our new data-point is 0 i.e Iris-setosa., Which means the points near to our new data points are Iris-Setosa. Let's plot this and see if our prediction is correct or not.
plt.scatter(dataSetosa.iloc[:,0],dataSetosa.iloc[:,1],color='b',label='Iris-Setosa')

plt.scatter(dataVersicolor.iloc[:,0], dataVersicolor.iloc[:,1],color='r',label='Iris-Vesicolor')

plt.scatter(dataVirginica.iloc[:,0],dataVirginica.iloc[:,1],color='k', label='Iris-Virginica')

plt.scatter(5.1,3.5,marker='o',color='g',label='New Point') #point to be predicted is denoted by x

plt.legend(loc='upper right')

    <matplotlib.legend.Legend at 0x16306ff3988>
That's It. A simple concept of calculating distance becomes a new algorithm in machine learning. The maths is neat and easy. Note that I have made this concept as easy as possible to make you understand. If you understand the basic idea, then you can go on to edit the codes and build a better KNN model.

Post a Comment

0 Comments