k-means clustering

k-means is an unsupervised algorithm that aims to cluster samples into k groups. In this tutorial we are going to write a python code to implements k-means clustering algorithm and solve the problem. There are lots of variants for this algorithm, but here we present the most simple and original one. The algorithm starts with choosing k samples as initial centers. Then for each sample, computes the distance between sample and k centers. Assign each sample to the nearest center. after assigning all samples, we have to compute new centers and run algorithm again. we will do this until no sample change it's cluster or after a certain number of iteration. but how to compute the distance between samples and clusters? Well, we can use any distance measures like manhatan distance or euclidean distance or whatever we like.

The data set we are using can be download from HERE.

Explanation of dataset

Sample data is like table below:

# Feature 1(X) Feature 2(Y)
1 713661.6583 3983664.9982
2 732635.2434 4023740.9569
3 720624.7737 4027533.2538
. . .

As you can see, each record has two features called X and Y. The number of sample data in our data set is 20,000. Now we are going to plot our data using python matplotlib library as below:

	#python 3.5 code for plotting dataset 
	from matplotlib import pyplot as plt
	import numpy as np
	my_data = np.genfromtxt('data.txt');
	plt.plot(my_data[:,0],my_data[:,1],'b*')
	plt.show()

		

And the result is:

As you can see in the image, the natural way to cluster this data is using five clusters.

Here is the python code for k-means clustering.


'''
Note that this is an Unsupervised algorithm.
This algorithm uses Forgy method for initialization since it is the preferable
    method for standard kmeans.
There is no guarantee tha kemans algorithm converge to global optimum and
    it highly depends on initialization. So you have to run it several times and
choose the best result.
It chooses random k samples as centers and continue to find the optimal center.
Stop condition :
    1. number of iteration >= max_number_of_iteration
    2. no change in clusters
Input parameters:
    samples : this is the dataset you want to cluster(numpy array)
    k : number of cluster
    centers : user provided centers(default is None) and it must be a list of list
    for example: centers = [
                 [feat1,feat2,feat3],
                 [feat1,feat2,feat3],
                 [feat1,feat2,feat3],
                 [feat1,feat2,feat3]
                ]
    max_number_of_iteration : maximum number of iteration to stop(default is 1000).
'''
    
import numpy as np
class kmeans():
    def __init__(self,samples,k=2,centers=None,max_number_of_iteration=1000):
        self.max_number_of_iteration = max_number_of_iteration;
        self.number_of_samples = len(samples);
        self.cluster_number = np.ones((self.number_of_samples,)) * -1
        self.k = k;
        self.samples = samples
        
        if(centers == None):
            self.centers = list();
            #select k random centers
            temp_list=list()
            while(len(self.centers)!=k):
                center_temp = np.random.randint(0,self.number_of_samples);
                if (center_temp not in temp_list):
                    self.centers.append(self.samples[center_temp]);
                    temp_list.append(center_temp)
                #end if
            #end while
        else:
            if isinstance(centers,list):
                if(len(centers) == k):
                    self.centers = centers
                else:
                   print("ERROR : centers length must be equal to number of clusters(k)");
                   return;
            else:
                print("ERROR : centers must be list");
                return;
        #end if
        
    def compute_distance(self,instance1,instance2):
        #euclidean distance
        #you can replace this method with any metrics you like.
        result = 0;
        if(len(instance1)!=len(instance2)):
            print("ERROR: instance1 and instance2 must be the same size.")
            return
        #return (np.sqrt(sum((abs(instance1-instance2)))))
        return (np.sqrt(sum((instance1-instance2)**2)))
        
    def run(self):
        changed = np.ones((self.number_of_samples,));
        iteration = 0;
        distance_array = np.zeros((self.number_of_samples,self.k))
        print('Starting iteration....')
        #YOU CAN CHANGE THE STOP CONDITION HERE BY REMOVING ONE OF THESE CLAUSES
        while(iteration < self.max_number_of_iteration and np.any(changed)==1):
            for i in range(self.number_of_samples):
                for j in range(self.k):
                    distance_array[i,j] = self.compute_distance(self.samples[i],
                                                                self.centers[j]);
                #enf for j
                #print(distance_array[i]);
                if(self.cluster_number[i]==(np.argmin(distance_array[i]))):
                   changed[i] = 0
                else:
                   changed[i] = 1;
                   self.cluster_number[i] = np.argmin(distance_array[i]);
            #end for i
            iteration +=1
            #now calculate new centers
            for k in range(self.k):
                if(len(self.samples[self.cluster_number==k])!=0):
                    self.centers[k] = sum(self.samples[self.cluster_number==k])/(len(
                        self.samples[self.cluster_number==k]))
                #print(self.centers[k]);
            print("iter #"+str(iteration))
        print('Done...');
        print("Number of data in each cluster is : ")
        for i in range(self.k):
            num = len(self.samples[self.cluster_number==i])
            print("Cluster #"+ str(i) + " has " + str(num) + " member.")
        return self.cluster_number;
        
if __name__=="__main__":
    import matplotlib as mpl
    from matplotlib import pyplot as plt
    data = np.genfromtxt("data.txt");
    k=4
    #centers = [[0.5,0.5],[0.5,0.5],[0.5,0.5],[0.5,0.5]]
    kmeans_test = kmeans(data,k);
    kmeans_test.run()
    
    kmeans_test.centers = np.array(kmeans_test.centers)
    plt.plot(kmeans_test.centers[:,0],kmeans_test.centers[:,1],'b*',markersize=13)
    for i in range(k):
        class1 =  kmeans_test.samples[kmeans_test.cluster_number==i]
        plt.plot(class1[:,0],class1[:,1],'r.')
    plt.show()
 
		     
You can download the code from here