BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

- It is a scalable clustering method.
- Designed for very large data sets
- Only one scan of data is necessary
- It is based on the notation of CF (Clustering Feature) a CF Tree.
- CF tree is a height balanced tree that stores the clustering features for a hierarchical clustering.
Cluster of data points is represented by a triple of numbers (N,LS,SS) Where

N= Number of items in the sub cluster

LS=Linear sum of the points

SS=sum of the squared of the points

**A CF Tree structure is given as below:**

- Each non-leaf node has at most B entries.
- Each leaf node has at most L CF entries which satisfy threshold T, a maximum diameter of radius
- P(page size in bytes) is the maximum size of a node
- Compact: each leaf node is a subcluster, not a data point

**Basic Algorithm:**

Phase 1: Load data into memory

Scan DB and load data into memory by building a CF tree. If memory is exhausted rebuild the tree from the leaf node.

Phase 2: Condense data

Resize the data set by building a smaller CF tree

Remove more outliers

Condensing is optional

Phase 3: Global clustering

Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries

Phase 4: Cluster refining

Refining is optional

Fixes the problem with CF trees where same valued data points may be assigned to different leaf entries.

**Example:**

**Clustering feature:**

CF= (N, LS, SS)

N: number of data points

LS: $ ∑_{i=1}^N= X_i $

$$SS:∑_{i=1}^N= X_I^2 $$

(3,4) (2,6)(4,5)(4,7)(3,8)

N=5

NS= (16, 30 ) i.e. 3+2+4+4+3=16 and 4+6+5+7+8=30

$SS=(54,190)=3^2+2^2+4^2+4^2+3^2 =54 \ \ and \ \ 4^2+6^2+5^2+7^2+8^2 =190$

- Advantages: Finds a good clustering with a single scan and improves the quality with a few additional scans
- Disadvantages: Handles only numeric data
Applications:

Pixel classification in images

Image compression

Works with very large data sets