Explain data discretization and concept hierarchy generation

1.7kviews

written 2.3 years ago by

binitamayekar ★ 6.5k

• modified 2.3 years ago

Data Discretization

Dividing the range of a continuous attribute into intervals.
Interval labels can then be used to replace actual data values.
Reduce the number of values for a given continuous attribute.
Some classification algorithms only accept categorically attributes.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Discretization techniques can be categorized based on whether it uses class information or not such as follows:
- Supervised Discretization - This discretization process uses class information.
- Unsupervised Discretization - This discretization process does not use class information.
Discretization techniques can be categorized based on which direction it proceeds as follows:

Top-down Discretization -

If the process starts by first finding one or a few points called split points or cut points to split the entire attribute range and then repeat this recursively on the resulting intervals.

Bottom-up Discretization -

Starts by considering all of the continuous values as potential split-points.
Removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals.

Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as a Concept Hierarchy.
Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies.
This organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input and output operations and is more efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than during mining.

1] Binning

Binning is a top-down splitting technique based on a specified number of bins.
Binning is an unsupervised discretization technique because it does not use class information.
In this, The sorted values are distributed into several buckets or bins and then replaced with each bin value by the bin mean or median.
It is further classified into
- Equal-width (distance) partitioning
- Equal-depth (frequency) partitioning

2] Histogram Analysis

It is an unsupervised discretization technique because histogram analysis does not use class information.
Histograms partition the values for an attribute into disjoint ranges called buckets.
It is also further classified into
- Equal-width histogram
- Equal frequency histogram
The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached.

3] Cluster Analysis

Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the values of A into clusters or groups.
Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization results.
Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.

4] Entropy-Based Discretization

Entropy-based discretization is a supervised, top-down splitting technique.
It explores class distribution information in its calculation and determination of split points.
Let D consist of data instances defined by a set of attributes and a class-label attribute.
The class-label attribute provides the class information per instance.
In this, the interval boundaries or split-points defined may help to improve classification accuracy.
The entropy and information gain measures are used for decision tree induction.

5] Interval Merge by χ2 Analysis

It is a bottom-up method.
Find the best neighboring intervals and merge them to form larger intervals recursively.
The method is supervised in that it uses class information.
ChiMerge treats intervals as discrete categories.
The basic notion is that for accurate discretization, the relative class frequencies should be fairly consistent within an interval.
Therefore, if two adjacent intervals have a very similar distribution of classes, then the intervals can be merged.
Otherwise, they should remain separate.

ADD COMMENT EDIT