Write short notes on: Outlier analysis.
1 Answer
  • A database may contain data objects that do not comply with the general behavior or
  • model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
  • Most data mining methods discard outliers as noise or exceptions.
  • However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
  • The analysis of outlier data is referred to as outlier mining.
  • Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are a substantial distance from any other cluster are considered outliers.
  • Rather than using statistical or distance measures, deviation-based methods identify outliers by examining differences in the main characteristics of objects in a group.
  • Outliers can be caused by measurement or execution error.
  • Outliers may be the result of inherent data variability.
  • Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
  • This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal.
  • Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.

There are four approaches to computer-based methods for outlier detection.

  • The statistical approach:

    • This approach assumes a distribution for the given data set and then identifies outliers with respect to the model using a discordancy test.
    • A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis.
    • A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is H:o_i ϵ F, where i= 1,2,....,n.

    • The hypothesis is retained if there is no statistically significant evidence supporting its rejection.

    • A discordancy test verifies whether an object, oi , is significantly large (or small) in relation to the distribution F.
    • Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the statistic for object oiisvi , then the distribution of T is constructed.
    • Significance probability, SP(vi)=Prob(T >vi), is evaluated.
    • If SP(vi ) is sufficiently small, then oi is discordant and the working hypothesis is rejected.
    • An alternative hypothesis, H, which states that oi comes from another distribution model (G), is adopted.
    • The result is very much dependent on which model is chosen because oi may be an outlier under one model and a perfectly valid value under another.
  • The distance-based approach:
    • This approach generalizes the ideas behind discordancy testing for various standard distributions and its neighbors are defined based on their distance from the given object.
    • Several efficient algorithms for mining distance-based outliers have been developed.
  1. Index-based algorithm:

    • Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object ‘o’within radius ‘dmin’ around that object.
    • Let Mbe the maximum number of objects within the dmin-neighborhood of an outlier.
    • Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier.
    • This algorithm has a worst-case complexity of O(n2k), where n is the number of objects in the data set and k is the dimensionality.
    • The index-based algorithm scales well as k increases.
  2. Nested-loop algorithm:

    • The nested-loop algorithm has the same computational complexity as the index-based algorithm but avoids index structure construction and tries to minimize the number of I/Os.
    • It divides the memory buffer space into two halves and the data set into several logical blocks.
    • By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.
  3. Cell-based algorithm:

    • To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-resident data sets.
    • Its complexity is O(ck+n), where c is a constant depending on the number of cells and k is the dimensionality.
    • In this method, the data space is partitioned into cells with a side length equal to dmin/2√k.
    • Each cell has two layers surrounding it.
    • The first layer is one cell thick, while the second is (d2/√k-1)1e cells thick, rounded up to the closest integer.
    • The algorithm counts outliers on a cell-by-cell rather than an object-by-object basis.

    • The density-based local outlier approach:

    • Distance based outlier detection faces difficulty in identifying outliers if data is not uniformly distributed.
    • Therefore this approach is used which depends on the overall distribution of the given set of data points
    • Eg: Figure shows a simple 2-D data set containing 502 objects, with two obvious clusters.
    • Cluster C1 contains 400 objects while Cluster C2 contains 100 objects.
    • Two additional objects, o1 and o2 are clearly outliers.

enter image description here

  • However, by distance-based outlier detection only o1 is a reasonable outlier
    • This brings us to the notion of local outliers.
    • An object is a local outlier if it is outlyingrelative to its local neighborhood, particularly with respect to the density of the neighborhood.
    • In this view, o2 is a local outlier relative to the density of C2.
    • Object o1 is an outlier as well, and no objects in C1 are mislabeled as outliers.
    • This forms the basis of density-based local outlier detection.
    • Therefore unlike previous methods, it does not consider being an outlier as a binary property. Instead, it assesses the degree to which an object is an outlier.
    • This degree of “outlierness” is computed as the local outlier factor (LOF) of an object.
    • This degree of “outlierness” depends on how isolated the object is with respect to the surrounding neighborhood.
    • This approach can detect both global and local outliers.
  • The deviation-based approach:
    • This approach identifies outliers by examining the main characteristics of objects in a group.
    • Objects that “deviate” from this description are considered outliers.
    • Hence, in this approach the term deviation is typically used to refer to outliers.
    • There are two techniques for deviation-based outlier detection.
    • The first sequentially compares objects in a set, while the second employs an OLAP data cube approach.
Please log in to add an answer.