Explain various attribute selection measures

2.6kviews

written 2.3 years ago by

binitamayekar ★ 6.5k

• modified 2.3 years ago

Attribute Selection Measures

The measure of attribute selection is a heuristic in nature for selecting the splitting criterion that “best” separates a given data partition, D, of class-labeled training tuples into individual classes.
It determines how the tuples at a given node are to be split.
The attribute selection measure provides a ranking for each attribute describing the given training tuples.
The three methods are used for attribute selection as follows:
- Information Gain
- Gain Ratio
- Gini Index

The Information gain is used to select the splitting attribute in each node in the decision tree.
It follows the method of entropy while aiming at reducing the level of entropy, starting from the root node to the leaf nodes.
The attribute with the highest information gain is chosen as the splitting attribute for the current node.
It is biased towards the multi-valued attribute.
The information gained on attribute A is the mutual information that exists between the attribute Class and attribute A.
It is defined as follows:

$$ Infromation\ Gain\ (A) = H(Class) - H(Class | A) $$

It is an unbalanced split.
In this one partition is much smaller than the other partition.
The gain ratio on attribute A is the ratio of the information gained on A over the expected information of A, normalizing uncertainty across attributes.
It is defined as follows:

$$ Gain\ Ratio\ (A) = \frac {H(Class) - H(Class | A)}{ H(A)} $$

The Gini index measures uses binary split for each attribute.
In this partitions are equal.
The attribute with the minimum Gini index is selected as the splitting attribute.
It is also biased toward the multi-valued attribute.
It can not manage a large number of classes.
The Gini function measures the impurity of an attribute with respect to classes.
The impurity function is defined as:

$$ Gini\ (Class) = 1 - \sum p_i^2 $$

The Gini index of A defined below, is the difference between the impurity of Class and the average impurity of A regarding the classes, representing a reduction of impurity over the choice of attribute A.
The Gini index is defined as follows:

$$ Gini\ lndex\ (A) = Gini\ (Class) - \sum_{j = 0}^m P(c_j )\ Gini\ (A = c_j ) $$

ADD COMMENT EDIT