Explain data discretization and summarization

53views

written 8.1 years ago by

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values

Entropy-Based Discretization

• Entropy is one of the most commonly used discretization measures.

• Entropy-based discretization is a supervised, top-down splitting technique.

• It explores class distribution information in its calculation and determination of split-points (data values for partitioning an attribute range).

• Entropy-based discretization can reduce data size.

• Unlike the other methods mentioned here so far, entropy-based discretization uses class information.

• This makes it more likely that the interval boundaries (split-points) are defined to occur in places that may help improve classification accuracy.

• The entropy and information gain measures described here are also used for decision tree induction.

• The expected information requirement for classifying a tuple is given by:

enter image description here

DATA SUMMARIZATION

• Summarization is a key data mining concept which involves techniques for finding a compact description of a dataset.

• Simple summarization methods such as tabulating the mean and standard deviations are often applied for data analysis, data visualization and automated report generation.

• Clustering[13, 23] is another data mining technique that is often used to summarize large datasets.

• Summarization can be viewed as compressing a given set of transactions into a smaller set of patterns while retaining the maximum possible information.

• A trivial summary for a set of transactions would be itself.

• The information loss here is zero but there is no compaction.

• Another trivial summary would be the empty set , which represents all the transactions.

• In this case the gain in compaction is maximum but the summary has no information content.

• A good summary is one which is small but still retains enough information about the data as a whole and also for each transaction.

Summarization Using Clustering

o Here we present a direct application of clustering to obtain a summary for a given set of transactions with categorical attributes.

o This simple algorithm involves clustering of the data using any standard clustering algorithm and then replacing each cluster with a representation using feature-wise intersection of all transactions in that cluster.

o The number of clusters here determine the compaction gain for the summary.

o Step 2 generates l clusters, while step 3 and 4 generate the summary description for each of the individual clusters.

o For example consider the sample data set of 8 transactions in Table.

enter image description here

o Let clustering generate two clusters for this data set

(C1 ={T1,T2,T3,T4,T8}and C2 = {T5,T6,T7})

o Table shows a summary obtained using the clustering based algorithm.

enter image description here

ADD COMMENT EDIT