What is clustering? Explain k-means clustering algorithm.

971views

written 7.8 years ago by

ashishravindrasalve • 870

modified 7.8 years ago by

ramnath • 100

Clustering is a data mining technique used to place data elements into related groups without advance knowledge of the group definitions.
Clustering is a process of partitioning a set of data in set of meaningful sub-classes, called as clusters.
A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

In this case, we easily identify the 4 clusters into which the data can be divided.

k-means algorithm:

K-means clustering is an algorithm to classify or to group the different object based on attributes or features into K number of group.
K is positive integer number(which can be decided by user)
Define K centroids for K clusters which are generally far away from each other.
Then group the elements into clusters which are nearer to the centroid of that cluster.
After this first step, again calculate the new centroid for each cluster based on the elements of that cluster.
Follow the same method, and group the elements based on new centroid.
In every step, the centroid changes and elements move from one cluster to another.
Do the same process till no element is moving from one cluster to another.

Algorithm:

k: number of clusters

n :sample features vectors $x_1, x_2………x_n$

$m_i$: the mean of the vectors in cluster i

Assume k<n< p="">
- Make initial guesses for the mean m_1 , m_2……..,m_k
- Until there is no changes in any mean
  - Use the estimated means to classify the samples into clusters.
  - For I from 1 to k
  Replace m_i with the mean of all of the samples for cluster i
  - End _for
- End _until
- Suppose the data for clustering – 2,4,10,12,3,20,11,25
1. Randomly assign means $m_1$=3 and $m_2$=4
2. The number which are close to mean $m_1$=3 are grouped into cluster $k_1$ and numbers which are close to mean $m_2$=4 are grouped into cluster $k_2$
3. Again calculate the new mean for new cluster groups
4. $k_1$={2,3} , $k_2$= {4,10,12,20,30,11,25} , m_1=2.5, $m_2$=16
5. $k_1$={2,3,4} , $k_2$= {10,12,20,30,11,25}, $m_1$=3, $m_2$=18
6. $k_1$={2,3,4,10}, k_2= {12,20,30,11,25}, $m_1$=4.75, $m_2$=19.6
7. $k_1$={2,3,4,10,11,12}, $k_2$= {20,30,25}, $m_1=7, m_2=25$
8. $k_1$={2,3,4,10,11,12}, $k_2$= {20,30,25}
9. Stop as clusters with these means in step 7 and 8 are same.
10. So the final answer is $k_1={2,3,4,10,11,12}, k_2= {20,30,25}$

ADD COMMENT EDIT