## k-means clustering

k-means is an unsupervised algorithm that aims to cluster samples into k groups. In this tutorial we are going to write a python code to implements k-means clustering algorithm and solve the problem. There are lots of variants for this algorithm, but here we present the most simple and original one. The algorithm starts with choosing k samples as initial centers. Then for each sample, computes the distance between sample and k centers. Assign each sample to the nearest center. after assigning all samples, we have to compute new centers and run algorithm again. we will do this until no sample change it's cluster or after a certain number of iteration. but how to compute the distance between samples and clusters? Well, we can use any distance measures like manhatan distance or euclidean distance or whatever we like.

The data set we are using can be download from HERE.

#### Explanation of dataset

Sample data is like table below:

# | Feature 1(X) | Feature 2(Y) |

1 | 713661.6583 | 3983664.9982 |

2 | 732635.2434 | 4023740.9569 |

3 | 720624.7737 | 4027533.2538 |

. | . | . |

As you can see, each record has two features called X and Y. The number of sample data in our data set is 20,000. Now we are going to plot our data using python matplotlib library as below:

```
#python 3.5 code for plotting dataset
from matplotlib import pyplot as plt
import numpy as np
my_data = np.genfromtxt('data.txt');
plt.plot(my_data[:,0],my_data[:,1],'b*')
plt.show()
```

And the result is:

As you can see in the image, the natural way to cluster this data is using five clusters.Here is the python code for k-means clustering.

```
'''
Note that this is an Unsupervised algorithm.
This algorithm uses Forgy method for initialization since it is the preferable
method for standard kmeans.
There is no guarantee tha kemans algorithm converge to global optimum and
it highly depends on initialization. So you have to run it several times and
choose the best result.
It chooses random k samples as centers and continue to find the optimal center.
Stop condition :
1. number of iteration >= max_number_of_iteration
2. no change in clusters
Input parameters:
samples : this is the dataset you want to cluster(numpy array)
k : number of cluster
centers : user provided centers(default is None) and it must be a list of list
for example: centers = [
[feat1,feat2,feat3],
[feat1,feat2,feat3],
[feat1,feat2,feat3],
[feat1,feat2,feat3]
]
max_number_of_iteration : maximum number of iteration to stop(default is 1000).
'''
import numpy as np
class kmeans():
def __init__(self,samples,k=2,centers=None,max_number_of_iteration=1000):
self.max_number_of_iteration = max_number_of_iteration;
self.number_of_samples = len(samples);
self.cluster_number = np.ones((self.number_of_samples,)) * -1
self.k = k;
self.samples = samples
if(centers == None):
self.centers = list();
#select k random centers
temp_list=list()
while(len(self.centers)!=k):
center_temp = np.random.randint(0,self.number_of_samples);
if (center_temp not in temp_list):
self.centers.append(self.samples[center_temp]);
temp_list.append(center_temp)
#end if
#end while
else:
if isinstance(centers,list):
if(len(centers) == k):
self.centers = centers
else:
print("ERROR : centers length must be equal to number of clusters(k)");
return;
else:
print("ERROR : centers must be list");
return;
#end if
def compute_distance(self,instance1,instance2):
#euclidean distance
#you can replace this method with any metrics you like.
result = 0;
if(len(instance1)!=len(instance2)):
print("ERROR: instance1 and instance2 must be the same size.")
return
#return (np.sqrt(sum((abs(instance1-instance2)))))
return (np.sqrt(sum((instance1-instance2)**2)))
def run(self):
changed = np.ones((self.number_of_samples,));
iteration = 0;
distance_array = np.zeros((self.number_of_samples,self.k))
print('Starting iteration....')
#YOU CAN CHANGE THE STOP CONDITION HERE BY REMOVING ONE OF THESE CLAUSES
while(iteration < self.max_number_of_iteration and np.any(changed)==1):
for i in range(self.number_of_samples):
for j in range(self.k):
distance_array[i,j] = self.compute_distance(self.samples[i],
self.centers[j]);
#enf for j
#print(distance_array[i]);
if(self.cluster_number[i]==(np.argmin(distance_array[i]))):
changed[i] = 0
else:
changed[i] = 1;
self.cluster_number[i] = np.argmin(distance_array[i]);
#end for i
iteration +=1
#now calculate new centers
for k in range(self.k):
if(len(self.samples[self.cluster_number==k])!=0):
self.centers[k] = sum(self.samples[self.cluster_number==k])/(len(
self.samples[self.cluster_number==k]))
#print(self.centers[k]);
print("iter #"+str(iteration))
print('Done...');
print("Number of data in each cluster is : ")
for i in range(self.k):
num = len(self.samples[self.cluster_number==i])
print("Cluster #"+ str(i) + " has " + str(num) + " member.")
return self.cluster_number;
if __name__=="__main__":
import matplotlib as mpl
from matplotlib import pyplot as plt
data = np.genfromtxt("data.txt");
k=4
#centers = [[0.5,0.5],[0.5,0.5],[0.5,0.5],[0.5,0.5]]
kmeans_test = kmeans(data,k);
kmeans_test.run()
kmeans_test.centers = np.array(kmeans_test.centers)
plt.plot(kmeans_test.centers[:,0],kmeans_test.centers[:,1],'b*',markersize=13)
for i in range(k):
class1 = kmeans_test.samples[kmeans_test.cluster_number==i]
plt.plot(class1[:,0],class1[:,1],'r.')
plt.show()
```

You can download the code from here