- Cluster is a group of objects that belong to the same class.
- In other words the similar object are grouped in one cluster and dissimilar are grouped in other cluster.
- Clustering is the process of making group of abstract objects into classes of similar objects.
- The main advantage of Clustering over classification is that, It is adaptable to changes and help single out useful features that distinguished different groups.
Applications of Clustering:
- It is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.
- It can help marketers discover distinct groups in their customer basis and then characterize their customer groups based on purchasing patterns.
2. Land use:
- It helps in identification of areas of similar land use in an earth observation database.
- Identifying groups of motor insurance policy holders with a high average claim cost.
4. City Planning:
- It helps in the identification of groups of houses in a city according house type, value, and geographic location.
5. Earthquake Studies:
- Observed earthquake epicenters should be clustered along continent faults.
- It helps to find groups of similar stars and galaxies.
- It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations.
Requirements of Clustering in Data Mining
1. Scalability - We need highly scalable clustering algorithms to deal with large databases.
2. Ability to deal with different kind of attributes - Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data.
3. Discovery of clusters with attribute shape - The clustering algorithm should be capable of detect cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size.
4. High dimensionality - The clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space.
5. Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
6. Interpretability - The clustering results should be interpretable, comprehensible and usable.