What is noisy data? How to handle noisy data

3.0kviews

written 8.1 years ago by

• Noisy data is meaningless data.

• It includes any data that cannot be understood and interpreted correctly by machines, such as unstructured text.

• Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis.

• Noisy data can be caused by faulty data collection instruments, human or computer errors occurring at data entry, data transmission errors, limited buffer size for coordinating synchronized data transfer, inconsistencies in naming conventions or data codes used and inconsistent formats for input fields( eg:date).

Noisy data can be handled by following the given procedures:

Binning:

• Binning methods smooth a sorted data value by consulting the values around it.

• The sorted values are distributed into a number of “buckets,” or bins.

• Because binning methods consult the values around it, they perform local smoothing.

• Similarly, smoothing by bin medianscan be employed, in which each bin value is replaced by the bin median.

• In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries.

• Each bin value is then replaced by the closest boundary value.

• In general, the larger the width, the greater the effect of the smoothing.

• Alternatively, bins may be equal-width, where the interval range of values in each bin is constant.

• Binning is also used as a discretization technique.

Regression:

• Here data can be smoothed by fitting the data to a function.

• Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other.

• Multiple linear regressionis an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.

Clustering:

• Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”

• Similarly, values that fall outside of the set of clusters may also be considered outliers.

ADD COMMENT EDIT