0
29kviews
Write short notes on: Outlier analysis.
1
197views
• A database may contain data objects that do not comply with the general behavior or
• model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
• Most data mining methods discard outliers as noise or exceptions.
• However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
• The analysis of outlier data is referred to as outlier mining.
• Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are a substantial distance from any other cluster are considered outliers.
• Rather than using statistical or distance measures, deviation-based methods identify outliers by examining differences in the main characteristics of objects in a group.
• Outliers can be caused by measurement or execution error.
• Outliers may be the result of inherent data variability.
• Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
• This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal.
• Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.

There are four approaches to computer-based methods for outlier detection.

• The statistical approach:

• This approach assumes a distribution for the given data set and then identifies outliers with respect to the model using a discordancy test.
• A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis.
• A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is H:o_i ϵ F, where i= 1,2,....,n.

• The hypothesis is retained if there is no statistically significant evidence supporting its rejection.

• A discordancy test verifies whether an object, oi , is significantly large (or small) in relation to the distribution F.
• Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the statistic for object oiisvi , then the distribution of T is constructed.
• Significance probability, SP(vi)=Prob(T >vi), is evaluated.
• If SP(vi ) is sufficiently small, then oi is discordant and the working hypothesis is rejected.
• An alternative hypothesis, H, which states that oi comes from another distribution model (G), is adopted.
• The result is very much dependent on which model is chosen because oi may be an outlier under one model and a perfectly valid value under another.
• The distance-based approach:
• This approach generalizes the ideas behind discordancy testing for various standard distributions and its neighbors are defined based on their distance from the given object.
• Several efficient algorithms for mining distance-based outliers have been developed.
1. Index-based algorithm:

• Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object ‘o’within radius ‘dmin’ around that object.
• Let Mbe the maximum number of objects within the dmin-neighborhood of an outlier.
• Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier.
• This algorithm has a worst-case complexity of O(n2k), where n is the number of objects in the data set and k is the dimensionality.
• The index-based algorithm scales well as k increases.
2. Nested-loop algorithm:

• The nested-loop algorithm has the same computational complexity as the index-based algorithm but avoids index structure construction and tries to minimize the number of I/Os.
• It divides the memory buffer space into two halves and the data set into several logical blocks.
• By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.
3. Cell-based algorithm:

• To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-resident data sets.
• Its complexity is O(ck+n), where c is a constant depending on the number of cells and k is the dimensionality.
• In this method, the data space is partitioned into cells with a side length equal to dmin/2√k.
• Each cell has two layers surrounding it.
• The first layer is one cell thick, while the second is (d2/√k-1)1e cells thick, rounded up to the closest integer.
• The algorithm counts outliers on a cell-by-cell rather than an object-by-object basis.

• The density-based local outlier approach:

• Distance based outlier detection faces difficulty in identifying outliers if data is not uniformly distributed.
• Therefore this approach is used which depends on the overall distribution of the given set of data points
• Eg: Figure shows a simple 2-D data set containing 502 objects, with two obvious clusters.
• Cluster C1 contains 400 objects while Cluster C2 contains 100 objects.
• Two additional objects, o1 and o2 are clearly outliers.

• However, by distance-based outlier detection only o1 is a reasonable outlier
• This brings us to the notion of local outliers.
• An object is a local outlier if it is outlyingrelative to its local neighborhood, particularly with respect to the density of the neighborhood.
• In this view, o2 is a local outlier relative to the density of C2.
• Object o1 is an outlier as well, and no objects in C1 are mislabeled as outliers.
• This forms the basis of density-based local outlier detection.
• Therefore unlike previous methods, it does not consider being an outlier as a binary property. Instead, it assesses the degree to which an object is an outlier.
• This degree of “outlierness” is computed as the local outlier factor (LOF) of an object.
• This degree of “outlierness” depends on how isolated the object is with respect to the surrounding neighborhood.
• This approach can detect both global and local outliers.
• The deviation-based approach:
• This approach identifies outliers by examining the main characteristics of objects in a group.
• Objects that “deviate” from this description are considered outliers.
• Hence, in this approach the term deviation is typically used to refer to outliers.
• There are two techniques for deviation-based outlier detection.
• The first sequentially compares objects in a set, while the second employs an OLAP data cube approach.