1
20kviews
Discuss different steps involved in Data Preprocessing.
2
403views

## Steps Of data preprocessing:

1.Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

2.Data integration: using multiple databases, data cubes, or files.

3.Data transformation: normalization and aggregation.

4.Data reduction: reducing the volume but producing the same or similar analytical results.

5.Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

## Data cleaning

1. Fill in missing values (attribute or class value):

⦁ Ignore the tuple: usually done when class label is missing.

⦁ Use the attribute mean (or majority nominal value) to fill in the missing value

⦁ Use the attribute mean (or majority nominal value) for all samples belonging to the same class.

2. Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value. Identify outliers and smooth out noisy data

⦁ Binning

1. Sort the attribute values and partition them into bins (see "Unsuperviseddiscretization" below)

2. Then smooth by bin means, bin median, or bin boundaries.

⦁ Clustering: group values in clusters and then detect and remove outliers (automatic or manual)

⦁ Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data: use domain knowledge or expert decision.

## Data transformation

1. Normalization:

⦁ Scaling attribute values to fall within a specified range. Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)

⦁ Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev

2. Aggregation: moving up in the concept hierarchy on numeric attributes.

3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

## Data reduction

1. Reducing the number of attributes ⦁ Data cube aggregation: applying roll-up, slice or dice operations. ⦁ Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space

⦁ Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..

2. Reducing the number of attribute values

⦁ Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).

⦁ Clustering: grouping values in clusters.

⦁ Aggregation or generalization

3. Reducing the number of tuples

⦁ Sampling