k-means is an unsupervised algorithm that aims to cluster samples into k groups. In this tutorial we are going to write a python code to implements k-means clustering algorithm and solve the problem. There are lots of variants for this algorithm, but here we present the most simple and original one. The algorithm starts with choosing k samples as initial centers. Then for each sample, computes the distance between sample and k centers. Assign each sample to the nearest center. after assigning all samples, we have to compute new centers and run algorithm again. we will do this until no sample change it's cluster or after a certain number of iteration. but how to compute the distance between samples and clusters? Well, we can use any distance measures like manhatan distance or euclidean distance or whatever we like.
The data set we are using can be download from HERE.
Explanation of dataset
Sample data is like table below:
|#||Feature 1(X)||Feature 2(Y)|
As you can see, each record has two features called X and Y. The number of sample data in our data set is 20,000. Now we are going to plot our data using python matplotlib library as below:
#python 3.5 code for plotting dataset from matplotlib import pyplot as plt import numpy as np my_data = np.genfromtxt('data.txt'); plt.plot(my_data[:,0],my_data[:,1],'b*') plt.show()
And the result is:
Here is the python code for k-means clustering.
You can download the code from here
''' Note that this is an Unsupervised algorithm. This algorithm uses Forgy method for initialization since it is the preferable method for standard kmeans. There is no guarantee tha kemans algorithm converge to global optimum and it highly depends on initialization. So you have to run it several times and choose the best result. It chooses random k samples as centers and continue to find the optimal center. Stop condition : 1. number of iteration >= max_number_of_iteration 2. no change in clusters Input parameters: samples : this is the dataset you want to cluster(numpy array) k : number of cluster centers : user provided centers(default is None) and it must be a list of list for example: centers = [ [feat1,feat2,feat3], [feat1,feat2,feat3], [feat1,feat2,feat3], [feat1,feat2,feat3] ] max_number_of_iteration : maximum number of iteration to stop(default is 1000). ''' import numpy as np class kmeans(): def __init__(self,samples,k=2,centers=None,max_number_of_iteration=1000): self.max_number_of_iteration = max_number_of_iteration; self.number_of_samples = len(samples); self.cluster_number = np.ones((self.number_of_samples,)) * -1 self.k = k; self.samples = samples if(centers == None): self.centers = list(); #select k random centers temp_list=list() while(len(self.centers)!=k): center_temp = np.random.randint(0,self.number_of_samples); if (center_temp not in temp_list): self.centers.append(self.samples[center_temp]); temp_list.append(center_temp) #end if #end while else: if isinstance(centers,list): if(len(centers) == k): self.centers = centers else: print("ERROR : centers length must be equal to number of clusters(k)"); return; else: print("ERROR : centers must be list"); return; #end if def compute_distance(self,instance1,instance2): #euclidean distance #you can replace this method with any metrics you like. result = 0; if(len(instance1)!=len(instance2)): print("ERROR: instance1 and instance2 must be the same size.") return #return (np.sqrt(sum((abs(instance1-instance2))))) return (np.sqrt(sum((instance1-instance2)**2))) def run(self): changed = np.ones((self.number_of_samples,)); iteration = 0; distance_array = np.zeros((self.number_of_samples,self.k)) print('Starting iteration....') #YOU CAN CHANGE THE STOP CONDITION HERE BY REMOVING ONE OF THESE CLAUSES while(iteration < self.max_number_of_iteration and np.any(changed)==1): for i in range(self.number_of_samples): for j in range(self.k): distance_array[i,j] = self.compute_distance(self.samples[i], self.centers[j]); #enf for j #print(distance_array[i]); if(self.cluster_number[i]==(np.argmin(distance_array[i]))): changed[i] = 0 else: changed[i] = 1; self.cluster_number[i] = np.argmin(distance_array[i]); #end for i iteration +=1 #now calculate new centers for k in range(self.k): if(len(self.samples[self.cluster_number==k])!=0): self.centers[k] = sum(self.samples[self.cluster_number==k])/(len( self.samples[self.cluster_number==k])) #print(self.centers[k]); print("iter #"+str(iteration)) print('Done...'); print("Number of data in each cluster is : ") for i in range(self.k): num = len(self.samples[self.cluster_number==i]) print("Cluster #"+ str(i) + " has " + str(num) + " member.") return self.cluster_number; if __name__=="__main__": import matplotlib as mpl from matplotlib import pyplot as plt data = np.genfromtxt("data.txt"); k=4 #centers = [[0.5,0.5],[0.5,0.5],[0.5,0.5],[0.5,0.5]] kmeans_test = kmeans(data,k); kmeans_test.run() kmeans_test.centers = np.array(kmeans_test.centers) plt.plot(kmeans_test.centers[:,0],kmeans_test.centers[:,1],'b*',markersize=13) for i in range(k): class1 = kmeans_test.samples[kmeans_test.cluster_number==i] plt.plot(class1[:,0],class1[:,1],'r.') plt.show()