Explain data discretization and concept hierarchy generation
1 Answer

Data Discretization

  • Dividing the range of a continuous attribute into intervals.
  • Interval labels can then be used to replace actual data values.
  • Reduce the number of values for a given continuous attribute.
  • Some classification algorithms only accept categorically attributes.
  • This leads to a concise, easy-to-use, knowledge-level representation of mining results.
  • Discretization techniques can be categorized based on whether it uses class information or not such as follows:
    • Supervised Discretization - This discretization process uses class information.
    • Unsupervised Discretization - This discretization process does not use class information.
  • Discretization techniques can be categorized based on which direction it proceeds as follows:

Top-down Discretization -

  • If the process starts by first finding one or a few points called split points or cut points to split the entire attribute range and then repeat this recursively on the resulting intervals.

Bottom-up Discretization -

  • Starts by considering all of the continuous values as potential split-points.
  • Removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals.

Concept Hierarchies

  • Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as a Concept Hierarchy.
  • Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
  • In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies.
  • This organization provides users with the flexibility to view data from different perspectives.
  • Data mining on a reduced data set means fewer input and output operations and is more efficient than mining on a larger data set.
  • Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than during mining.

Typical Methods of Discretization and Concept Hierarchy Generation for Numerical Data

1] Binning

  • Binning is a top-down splitting technique based on a specified number of bins.
  • Binning is an unsupervised discretization technique because it does not use class information.
  • In this, The sorted values are distributed into several buckets or bins and then replaced with each bin value by the bin mean or median.
  • It is further classified into
    • Equal-width (distance) partitioning
    • Equal-depth (frequency) partitioning

2] Histogram Analysis

  • It is an unsupervised discretization technique because histogram analysis does not use class information.
  • Histograms partition the values for an attribute into disjoint ranges called buckets.
  • It is also further classified into
    • Equal-width histogram
    • Equal frequency histogram
  • The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached.

3] Cluster Analysis

  • Cluster analysis is a popular data discretization method.
  • A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the values of A into clusters or groups.
  • Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization results.
  • Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.

4] Entropy-Based Discretization

  • Entropy-based discretization is a supervised, top-down splitting technique.
  • It explores class distribution information in its calculation and determination of split points.
  • Let D consist of data instances defined by a set of attributes and a class-label attribute.
  • The class-label attribute provides the class information per instance.
  • In this, the interval boundaries or split-points defined may help to improve classification accuracy.
  • The entropy and information gain measures are used for decision tree induction.

5] Interval Merge by χ2 Analysis

  • It is a bottom-up method.
  • Find the best neighboring intervals and merge them to form larger intervals recursively.
  • The method is supervised in that it uses class information.
  • ChiMerge treats intervals as discrete categories.
  • The basic notion is that for accurate discretization, the relative class frequencies should be fairly consistent within an interval.
  • Therefore, if two adjacent intervals have a very similar distribution of classes, then the intervals can be merged.
  • Otherwise, they should remain separate.
Please log in to add an answer.