0
37kviews
what is Concept Hierarchy? How Concept Hierarchy is generated for Numerical and categorical data?
4
461views

Concept Hierarchy reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

Concept hierarchy generation for numeric data is as follows:

• Binning (see sections before)
• Histogram analysis (see sections before)
• Clustering analysis (see sections before)
• Entropy-based discretization
• Segmentation by natural partitioning

• Binning

• In binning, first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Histogram analysis

• Histogram is a popular data reduction technique
• Divide data into buckets and store average (sum) for each bucket
• Can be constructed optimally in one dimension using dynamic programming
• Related to quantization problems.
• Clustering analysis

• Partition data set into clusters, and one can store cluster representation only
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering and be stored in multi-dimensional index tree structures
• Entropy-based discretization

• Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is – S1 & S2 correspond to samples in S satisfying conditions A<v &amp;="" a="">=v

• The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

• The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., Ent (S)- E(T,S)>δ
• Experiments show that it may reduce data size and improve classification accuracy
• Segmentation by natural partitioning

• 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
• If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals
• If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
• If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

Concept hierarchy generation for categorical data is as follows:

• Specification of a set of attributes, but not of their partial ordering

• Auto generate the attribute ordering based upon observation that attribute defining a high level concept has a smaller # of distinct values than an attribute defining a lower level concept
• Example : country (15), state_or_province (365), city (3567), street (674,339)
• Specification of only a partial set of attributes

• Try and parse database schema to determine complete hierarchy