Question: What is clustering? Explain k-means clustering algorithm.
0

Mumbai University > Information Technology > Sem6 > Data Mining and Business Intelligence

Marks: 10M

Year: Dec 2015

 modified 3.0 years ago by Ramnath • 3.7k written 3.0 years ago by
2
• Clustering is a data mining technique used to place data elements into related groups without advance knowledge of the group definitions.
• Clustering is a process of partitioning a set of data in set of meaningful sub-classes, called as clusters.
• A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

• In this case, we easily identify the 4 clusters into which the data can be divided.

k-means algorithm:

• K-means clustering is an algorithm to classify or to group the different object based on attributes or features into K number of group.
• K is positive integer number(which can be decided by user)
• Define K centroids for K clusters which are generally far away from each other.
• Then group the elements into clusters which are nearer to the centroid of that cluster.
• After this first step, again calculate the new centroid for each cluster based on the elements of that cluster.
• Follow the same method, and group the elements based on new centroid.
• In every step, the centroid changes and elements move from one cluster to another.
• Do the same process till no element is moving from one cluster to another.

Algorithm:

k: number of clusters

n :sample features vectors $x_1, x_2………x_n$

$m_i$: the mean of the vectors in cluster i

Assume k<n< p="">

• Make initial guesses for the mean m_1 , m_2……..,m_k

• Until there is no changes in any mean

• Use the estimated means to classify the samples into clusters.

• For I from 1 to k

Replace m_i with the mean of all of the samples for cluster i

• End _for
• End _until

• Suppose the data for clustering – 2,4,10,12,3,20,11,25

1. Randomly assign means $m_1$=3 and $m_2$=4
2. The number which are close to mean $m_1$=3 are grouped into cluster $k_1$ and numbers which are close to mean $m_2$=4 are grouped into cluster $k_2$
3. Again calculate the new mean for new cluster groups
4. $k_1$={2,3} , $k_2$= {4,10,12,20,30,11,25} , m_1=2.5, $m_2$=16
5. $k_1$={2,3,4} , $k_2$= {10,12,20,30,11,25}, $m_1$=3, $m_2$=18
6. $k_1$={2,3,4,10}, k_2= {12,20,30,11,25}, $m_1$=4.75, $m_2$=19.6
7. $k_1$={2,3,4,10,11,12}, $k_2$= {20,30,25}, $m_1=7, m_2=25$
8. $k_1$={2,3,4,10,11,12}, $k_2$= {20,30,25}
9. Stop as clusters with these means in step 7 and 8 are same.
10. So the final answer is $k_1={2,3,4,10,11,12}, k_2= {20,30,25}$