While Outliers, are attributed to a rare chance and may not necessarily be fully explainable, Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle them.
The contentious decision to consider or discard an outlier needs to be taken at the time of building the model. Outliers can drastically bias/change the fit estimates and predictions. It is left to the best judgement of the analyst to decide whether treating outliers is necessary and how to go about it.
Treating or altering the outlier/extreme values in genuine observations is not a standard operating procedure. If a data point (or points) is excluded from the data analysis, this should be clearly stated on any subsequent report.
An Outlier is a rare chance of occurrence within a given data set. In Data Science, an Outlier is an observation point that is distant from other observations. An Outlier may be due to variability in the measurement or it may indicate experimental error.
simple representation of an Outlier
We have proposed in five step outlier analysis procedures starting from data sets, data cleaning, outlier detection, representation, profiling, handling and evaluation.
Each step is explained in detail as follows.
a) Data sets are important for outlier analysis. There are different types of data set such as: Nominal, ordinal, interval, ratio, binary, continuous, discrete, Transaction Data, Spatial Data, Spatio-Temporal Data, and Sequence Data and Time Series data .
b) Data Cleaning: Identifying missing values is one of the data cleaning process. Missing values create difficulties for data analysis. The following measures can be used to process the missing values such as: Ignoring the record, can fill missing values manually, use global constant to fill in the missing values, use the attribute mean to fill in the missing values, use the attribute mean for all samples belonging to the same class as the given tuple
c) Outlier Detection Techniques: In the last decade numerous outlier detection methods have been proposed. The main focus is given on unsupervised outlier detection methods. Some of the outlier detection techniques can be used for generic purpose and some of them can be used for specific purpose .
d) Outlier detection approaches can be classified into these three categories: supervised, semi-supervised and unsupervised.
e) Techniques trained in supervised mode assume the availability of a training data set which has labeled instances for normal as well as anomaly class.
f) Typical approach in such cases is to build a predictive model for normal vs. anomaly classes.
g) The unsupervised approach of outlier detection does not require training data. This approach takes as input a set of unlabeled data and attempts to find outlier within the data.
Five Step Procedure of Outlier Analysis