Write short notes on: Outlier analysis.

235views

written 10.2 years ago by

priyadarshitathagat • 0

A database may contain data objects that do not comply with the general behavior or
model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are a substantial distance from any other cluster are considered outliers.
Rather than using statistical or distance measures, deviation-based methods identify outliers by examining differences in the main characteristics of objects in a group.
Outliers can be caused by measurement or execution error.
Outliers may be the result of inherent data variability.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal.
Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.

There are four approaches to computer-based methods for outlier detection.

The statistical approach:
- This approach assumes a distribution for the given data set and then identifies outliers with respect to the model using a discordancy test.
- A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis.
- A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is H:o_i ϵ F, where i= 1,2,....,n.
- The hypothesis is retained if there is no statistically significant evidence supporting its rejection.
- A discordancy test verifies whether an object, oi , is significantly large (or small) in relation to the distribution F.
- Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the statistic for object oiisvi , then the distribution of T is constructed.
- Significance probability, SP(vi)=Prob(T >vi), is evaluated.
- If SP(vi ) is sufficiently small, then oi is discordant and the working hypothesis is rejected.
- An alternative hypothesis, H, which states that oi comes from another distribution model (G), is adopted.
- The result is very much dependent on which model is chosen because oi may be an outlier under one model and a perfectly valid value under another.
The distance-based approach:
- This approach generalizes the ideas behind discordancy testing for various standard distributions and its neighbors are defined based on their distance from the given object.
- Several efficient algorithms for mining distance-based outliers have been developed.

Index-based algorithm:
- Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object ‘o’within radius ‘dmin’ around that object.
- Let Mbe the maximum number of objects within the dmin-neighborhood of an outlier.
- Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier.
- This algorithm has a worst-case complexity of O(n2k), where n is the number of objects in the data set and k is the dimensionality.
- The index-based algorithm scales well as k increases.
Nested-loop algorithm:
- The nested-loop algorithm has the same computational complexity as the index-based algorithm but avoids index structure construction and tries to minimize the number of I/Os.
- It divides the memory buffer space into two halves and the data set into several logical blocks.
- By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.
Cell-based algorithm:
- To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-resident data sets.
- Its complexity is O(ck+n), where c is a constant depending on the number of cells and k is the dimensionality.
- In this method, the data space is partitioned into cells with a side length equal to dmin/2√k.
- Each cell has two layers surrounding it.
- The first layer is one cell thick, while the second is (d2/√k-1)1e cells thick, rounded up to the closest integer.
- The algorithm counts outliers on a cell-by-cell rather than an object-by-object basis.
- The density-based local outlier approach:
- Distance based outlier detection faces difficulty in identifying outliers if data is not uniformly distributed.
- Therefore this approach is used which depends on the overall distribution of the given set of data points
- Eg: Figure shows a simple 2-D data set containing 502 objects, with two obvious clusters.
- Cluster C1 contains 400 objects while Cluster C2 contains 100 objects.
- Two additional objects, o1 and o2 are clearly outliers.

enter image description here

However, by distance-based outlier detection only o1 is a reasonable outlier
- This brings us to the notion of local outliers.
- An object is a local outlier if it is outlyingrelative to its local neighborhood, particularly with respect to the density of the neighborhood.
- In this view, o2 is a local outlier relative to the density of C2.
- Object o1 is an outlier as well, and no objects in C1 are mislabeled as outliers.
- This forms the basis of density-based local outlier detection.
- Therefore unlike previous methods, it does not consider being an outlier as a binary property. Instead, it assesses the degree to which an object is an outlier.
- This degree of “outlierness” is computed as the local outlier factor (LOF) of an object.
- This degree of “outlierness” depends on how isolated the object is with respect to the surrounding neighborhood.
- This approach can detect both global and local outliers.
The deviation-based approach:
- This approach identifies outliers by examining the main characteristics of objects in a group.
- Objects that “deviate” from this description are considered outliers.
- Hence, in this approach the term deviation is typically used to refer to outliers.
- There are two techniques for deviation-based outlier detection.
- The first sequentially compares objects in a set, while the second employs an OLAP data cube approach.

ADD COMMENT EDIT