0
66kviews
Explain BIRCH algorithm with example
1 Answer
9
3.3kviews

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

  • It is a scalable clustering method.
  • Designed for very large data sets
  • Only one scan of data is necessary
  • It is based on the notation of CF (Clustering Feature) a CF Tree.
  • CF tree is a height balanced tree that stores the clustering features for a hierarchical clustering.
  • Cluster of data points is represented by a triple of numbers (N,LS,SS) Where

    N= Number of items in the sub cluster

    LS=Linear sum of the points

    SS=sum of the squared of the points

A CF Tree structure is given as below:

  • Each non-leaf node has at most B entries.
  • Each leaf node has at most L CF entries which satisfy threshold T, a maximum diameter of radius
  • P(page size in bytes) is the maximum size of a node
  • Compact: each leaf node is a subcluster, not a data point

enter image description here

Basic Algorithm:

  • Phase 1: Load data into memory

    Scan DB and load data into memory by building a CF tree. If memory is exhausted rebuild the tree from the leaf node.

  • Phase 2: Condense data

    Resize the data set by building a smaller CF tree

    Remove more outliers

    Condensing is optional

  • Phase 3: Global clustering

    Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries

  • Phase 4: Cluster refining

    Refining is optional

    Fixes the problem with CF trees where same valued data points may be assigned to different leaf entries.

enter image description here

Example:

Clustering feature:

CF= (N, LS, SS)

N: number of data points

LS: $ ∑_{i=1}^N= X_i $

$$SS:∑_{i=1}^N= X_I^2 $$

enter image description here

(3,4) (2,6)(4,5)(4,7)(3,8)

N=5

NS= (16, 30 ) i.e. 3+2+4+4+3=16 and 4+6+5+7+8=30

$SS=(54,190)=3^2+2^2+4^2+4^2+3^2 =54 \ \ and \ \ 4^2+6^2+5^2+7^2+8^2 =190$

  • Advantages: Finds a good clustering with a single scan and improves the quality with a few additional scans
  • Disadvantages: Handles only numeric data
  • Applications:

    Pixel classification in images

    Image compression

    Works with very large data sets

Please log in to add an answer.