Question: What is clustering? Explain k-means clustering algorithm.
0

Mumbai University > Information Technology > Sem6 > Data Mining and Business Intelligence

Marks: 10M

Year: Dec 2015

ADD COMMENTlink
modified 3.2 years ago by gravatar for Ramnath Ramnath3.7k written 3.2 years ago by gravatar for ASHISH RAVINDRA SALVE ASHISH RAVINDRA SALVE10
2
  • Clustering is a data mining technique used to place data elements into related groups without advance knowledge of the group definitions.
  • Clustering is a process of partitioning a set of data in set of meaningful sub-classes, called as clusters.
  • A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

enter image description here

  • In this case, we easily identify the 4 clusters into which the data can be divided.

k-means algorithm:

  • K-means clustering is an algorithm to classify or to group the different object based on attributes or features into K number of group.
  • K is positive integer number(which can be decided by user)
  • Define K centroids for K clusters which are generally far away from each other.
  • Then group the elements into clusters which are nearer to the centroid of that cluster.
  • After this first step, again calculate the new centroid for each cluster based on the elements of that cluster.
  • Follow the same method, and group the elements based on new centroid.
  • In every step, the centroid changes and elements move from one cluster to another.
  • Do the same process till no element is moving from one cluster to another.

    Algorithm:

    k: number of clusters

    n :sample features vectors $x_1, x_2………x_n$

    $m_i$: the mean of the vectors in cluster i

    Assume k<n< p="">

    • Make initial guesses for the mean m_1 , m_2……..,m_k

    • Until there is no changes in any mean

      • Use the estimated means to classify the samples into clusters.

      • For I from 1 to k

      Replace m_i with the mean of all of the samples for cluster i

      • End _for
    • End _until

    • Suppose the data for clustering – 2,4,10,12,3,20,11,25

    1. Randomly assign means $m_1$=3 and $m_2$=4
    2. The number which are close to mean $m_1$=3 are grouped into cluster $k_1$ and numbers which are close to mean $m_2$=4 are grouped into cluster $k_2$
    3. Again calculate the new mean for new cluster groups
    4. $k_1$={2,3} , $k_2$= {4,10,12,20,30,11,25} , m_1=2.5, $m_2$=16
    5. $k_1$={2,3,4} , $k_2$= {10,12,20,30,11,25}, $m_1$=3, $m_2$=18
    6. $k_1$={2,3,4,10}, k_2= {12,20,30,11,25}, $m_1$=4.75, $m_2$=19.6
    7. $k_1$={2,3,4,10,11,12}, $k_2$= {20,30,25}, $m_1=7, m_2=25$
    8. $k_1$={2,3,4,10,11,12}, $k_2$= {20,30,25}
    9. Stop as clusters with these means in step 7 and 8 are same.
    10. So the final answer is $k_1={2,3,4,10,11,12}, k_2= {20,30,25}$

ADD COMMENTlink
modified 3.2 years ago by gravatar for Ramnath Ramnath3.7k written 3.2 years ago by gravatar for ASHISH RAVINDRA SALVE ASHISH RAVINDRA SALVE10
Please log in to add an answer.