Solving MNIST dataset classification using k-Nearest Neighbor
In the last part of this article, we used Neural network with single hidden layer to classify mnist dataset and the result was about %92.25 accuracy for 10,000 test data. In this article we are going to use k-nearest neighbor algorithm without preprocessing. If you want to know more about knn, you can refer to this in wikipedia.
The knn algorithm is always interesting for me since it has some unique characteristic. It involves all training data to make decision about each test data. Also you can use it supervised (classification problems) or unsupervised (clustering problems). You can use various type of metrics for measuring distance between instances. Some common metrics are:
- Euclidean distance : distance between two points in Euclidean space.
- Manhatan distance : distance between two points is the sum of the absolute differences of their Cartesian coordinates.
- Mahalanobis distance : distance between a point P and a distribution D.
I have implemented a simple knn algorithm using python 3.5 (from anaconda package which has numpy, scipy and matplotlib installed by default), you can donwload code from here. How does it work? this code read all the training images into memory (60,000 training image) and all the test images as well(10,000 test case) then for each indivisual test case, compute the distance between that test sample and all the training images. You can specify the k parameter which is 1 by default. After computing all distances, the function return the k nearest labels of the training set. I have run the program with k=1 and this is the result:
Reading magic number...OK Reading magic label number...OK Number of training image : 60000 k value of k-nearest neighbor algorithm: k=1 (1-nearest neighbor) Images are 28 by 28 pixels Reading All images into memory............. .......................................DONE!!! Reading magic number...OK Reading magic label number...OK Number of test image : 10000 Images are 28 by 28 pixels Reading All test images into memory.........DONE!!! analyzing 10,000 data...finished number of error is 309. error image index: 115 ,195 ,241 ,268 ,300 ,320 ,321 ,341 ,358 ,381 412 , 444 ,445 ,447 ,464 ,479 ....
Some samples (309 samples exactly) are classified incorrectly(misclassified) which means %3.09 error and %96.01 accuracy which is better than neural network results.(Our neural network implementation has %92.25 accuracy). These are some misclassified samples shown below (hover the mouse on each image to see the real and predicted value):
As you can see some of these samples are even realy hard to read for human.
Still we are not going to use preprocessing step since the main goal of this dataset is to predict label of the images only using machine learning techniques.
BUT THE RESULTS ARE PRETTY MUCH BETTER THAN NEURAL NETWORK MODEL.