- A database may contain data objects that do not comply with the general behavior or
- model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
- Most data mining methods discard outliers as noise or exceptions.
- However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
- The analysis of outlier data is referred to as outlier mining.
- Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are a substantial distance from any other cluster are considered outliers.
- Rather than using statistical or distance measures, deviation-based methods identify outliers by examining differences in the main characteristics of objects in a group.
- Outliers can be caused by measurement or execution error.
- Outliers may be the result of inherent data variability.
- Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
- This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal.
- Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.
There are four approaches to computer-based methods for outlier detection.
The statistical approach:
- This approach assumes a distribution for the given data set and then identifies outliers with respect to the model using a discordancy test.
- A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis.
A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is H:o_i ϵ F, where i= 1,2,....,n.
The hypothesis is retained if there is no statistically significant evidence supporting its rejection.
- A discordancy test verifies whether an object, oi , is significantly large (or small) in relation to the distribution F.
- Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the statistic for object oiisvi , then the distribution of T is constructed.
- Significance probability, SP(vi)=Prob(T >vi), is evaluated.
- If SP(vi ) is sufficiently small, then oi is discordant and the working hypothesis is rejected.
- An alternative hypothesis, H, which states that oi comes from another distribution model (G), is adopted.
- The result is very much dependent on which model is chosen because oi may be an outlier under one model and a perfectly valid value under another.
- The distance-based approach:
- This approach generalizes the ideas behind discordancy testing for various standard distributions and its neighbors are defined based on their distance from the given object.
- Several efficient algorithms for mining distance-based outliers have been developed.
- Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object ‘o’within radius ‘dmin’ around that object.
- Let Mbe the maximum number of objects within the dmin-neighborhood of an outlier.
- Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier.
- This algorithm has a worst-case complexity of O(n2k), where n is the number of objects in the data set and k is the dimensionality.
- The index-based algorithm scales well as k increases.
- The nested-loop algorithm has the same computational complexity as the index-based algorithm but avoids index structure construction and tries to minimize the number of I/Os.
- It divides the memory buffer space into two halves and the data set into several logical blocks.
- By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.
- To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-resident data sets.
- Its complexity is O(ck+n), where c is a constant depending on the number of cells and k is the dimensionality.
- In this method, the data space is partitioned into cells with a side length equal to dmin/2√k.
- Each cell has two layers surrounding it.
- The first layer is one cell thick, while the second is (d2/√k-1)1e cells thick, rounded up to the closest integer.
The algorithm counts outliers on a cell-by-cell rather than an object-by-object basis.
The density-based local outlier approach:
- Distance based outlier detection faces difficulty in identifying outliers if data is not uniformly distributed.
- Therefore this approach is used which depends on the overall distribution of the given set of data points
- Eg: Figure shows a simple 2-D data set containing 502 objects, with two obvious clusters.
- Cluster C1 contains 400 objects while Cluster C2 contains 100 objects.
- Two additional objects, o1 and o2 are clearly outliers.
- However, by distance-based outlier detection only o1 is a reasonable outlier
- This brings us to the notion of local outliers.
- An object is a local outlier if it is outlyingrelative to its local neighborhood, particularly with respect to the density of the neighborhood.
- In this view, o2 is a local outlier relative to the density of C2.
- Object o1 is an outlier as well, and no objects in C1 are mislabeled as outliers.
- This forms the basis of density-based local outlier detection.
- Therefore unlike previous methods, it does not consider being an outlier as a binary property. Instead, it assesses the degree to which an object is an outlier.
- This degree of “outlierness” is computed as the local outlier factor (LOF) of an object.
- This degree of “outlierness” depends on how isolated the object is with respect to the surrounding neighborhood.
- This approach can detect both global and local outliers.
- The deviation-based approach:
- This approach identifies outliers by examining the main characteristics of objects in a group.
- Objects that “deviate” from this description are considered outliers.
- Hence, in this approach the term deviation is typically used to refer to outliers.
- There are two techniques for deviation-based outlier detection.
- The first sequentially compares objects in a set, while the second employs an OLAP data cube approach.