0
8.3kviews
Explain data discretization and concept hierarchy generation
1
729views

## Data Discretization

• Dividing the range of a continuous attribute into intervals.
• Interval labels can then be used to replace actual data values.
• Reduce the number of values for a given continuous attribute.
• Some classification algorithms only accept categorically attributes.
• This leads to a concise, easy-to-use, knowledge-level representation of mining results.
• Discretization techniques can be categorized based on whether it uses class information or not such as follows:
• Supervised Discretization - This discretization process uses class information.
• Unsupervised Discretization - This discretization process does not use class information.
• Discretization techniques can be categorized based on which direction it proceeds as follows:

Top-down Discretization -

• If the process starts by first finding one or a few points called split points or cut points to split the entire attribute range and then repeat this recursively on the resulting intervals.

Bottom-up Discretization -

• Starts by considering all of the continuous values as potential split-points.
• Removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals.

## Concept Hierarchies

• Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as a Concept Hierarchy.
• Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies.
• This organization provides users with the flexibility to view data from different perspectives.
• Data mining on a reduced data set means fewer input and output operations and is more efficient than mining on a larger data set.
• Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than during mining.

## Typical Methods of Discretization and Concept Hierarchy Generation for Numerical Data

1] Binning

• Binning is a top-down splitting technique based on a specified number of bins.
• Binning is an unsupervised discretization technique because it does not use class information.
• In this, The sorted values are distributed into several buckets or bins and then replaced with each bin value by the bin mean or median.
• It is further classified into
• Equal-width (distance) partitioning
• Equal-depth (frequency) partitioning

2] Histogram Analysis

• It is an unsupervised discretization technique because histogram analysis does not use class information.
• Histograms partition the values for an attribute into disjoint ranges called buckets.
• It is also further classified into
• Equal-width histogram
• Equal frequency histogram
• The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached.

3] Cluster Analysis

• Cluster analysis is a popular data discretization method.
• A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the values of A into clusters or groups.
• Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization results.
• Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.

4] Entropy-Based Discretization

• Entropy-based discretization is a supervised, top-down splitting technique.
• It explores class distribution information in its calculation and determination of split points.
• Let D consist of data instances defined by a set of attributes and a class-label attribute.
• The class-label attribute provides the class information per instance.
• In this, the interval boundaries or split-points defined may help to improve classification accuracy.
• The entropy and information gain measures are used for decision tree induction.

5] Interval Merge by χ2 Analysis

• It is a bottom-up method.
• Find the best neighboring intervals and merge them to form larger intervals recursively.
• The method is supervised in that it uses class information.
• ChiMerge treats intervals as discrete categories.
• The basic notion is that for accurate discretization, the relative class frequencies should be fairly consistent within an interval.
• Therefore, if two adjacent intervals have a very similar distribution of classes, then the intervals can be merged.
• Otherwise, they should remain separate.