Write a short note on Data transformation and data discretization with respect to preprocessing

105views

written 24 months ago by

Data Transformation and Discretization

Data transformation in data mining is done by combining unstructured data with structured data to analyze it later. It is also important when the data is transferred to a new cloud data warehouse.
When the data is homogeneous and well-structured, it is easier to analyze and look for patterns.
For example, a company has acquired another firm and now has to consolidate all the business data. The Smaller company may be using a different database than the parent firm. Also, the data in these databases may have unique IDs, keys, and values. All this needs to be formatted so that all the records are similar and can be evaluated.
This is why data transformation methods are applied. And, they are described below:

Data Smoothing

This method is used for removing the noise from a dataset. Noise is referred to as the distorted and meaningless data within a dataset.
Smoothing uses algorithms to highlight the special features in the data.
After removing noise, the process can detect any small changes to the data to detect special patterns.
Any data modification or trend can be identified by this method.

Data Aggregation

Aggregation is the process of collecting data froma variely of sources and storing it in a single format. Here, data is collected, stored, analyzed, and presented in a report or summary format.
It helps in gathering more information about a particular data cluster. The method helps in collecting vast amounts of data.
This is a crucial step as accuracy and quantity of data is important for proper analysis.
Companies collect data about their website visitors. This gives them an idea about customer demographics and behavior metrics. This aggregated data assists them in designing personalized messages, offers, and discounts.

Discretization

This is a process of converting continuous data into a set of data intervals. Continuous attribute values are substituted by small interval labels. This makes the data easier to study and analyze.
If a continuous attribute is handled by a data mining task, then its discrete values can be replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a set of categorical data.
Discretization can be done by Binning, Histogram Analysis, and Correlation Analyses.
Discretization also uses decision tree-based aigorithms to produce short, compact, and accurate results when xSing discrete values.

Generalization

In this process, low-level data attributes are transformed into high-level data attributes using concept hierarchies. This conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the data.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher conceptual level into a categorical value (young, old).
Data generalization can be divided into two approaches
data cube process (OLAP) and attribute-oriented induction approach (AO).

Attribute construction

In the attribute construction method, new attributes are created from an existing set of attributes.
For example, in a dataset of employee information, the attributes can be employee name, employee ID, and address.
These attributes can be used to construct another dataset that contains information about the employees who have joined in the year 2019 only.
This method of reconstruction makes mining more piicient and helps in creating new datasets quickly.

Normalization

Also called data pre-processing, this is one of the crucial techniques for data transformation in data mining.
Here, the data is transformed so that it falls under a given range. When attributes are on different ranges or scales, data modeling and mining can be difficult.
Normalization helps in applying data mining algorithms and extracting data faster.
The popular normalization methods are:

Min-max normalization
In this technique of data normalization, a linear transformation is performed on the original data. Minimum and maximum value from data is fetched and each value is replaced according to the following formula:

\begin{array}{l} v^{\prime}=\frac{\mathrm{v}-\mathrm{min}{A}}{\max _{A}-\min _{A}}\left(\text { new_ }{A} \max _{A}-\text { new }{-} \text {min }_{A}\right) \ +\text { new_min }_{A} \end{array}
Where A is the attribute data, min-max are the minimum and maximum the absolute value of A respectively, vis the new value of each entry in data, v is the old value of each entry in data, new_maxA, new_min is the max and min value of the range (i.e. boundary value of the range required) respectively.
Example: Suppose the income range from $10,000 toEx. $95,000 is normalized to [0.0, 1.0]. By min-max normalization, a value of $64,300 for income is in transformed to 64300-10000 95000-100001.0-0.0) + 0.0 =0.6388

Z-score normalization
In this technique, values are normalized based on a mean and standard deviation of the data A.

Decimal scaling

It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide each value of the data by the maximum absolute value of the data. The data value, Vi, of data is normalized to v; by using the formula below:
where, j is the smallest integer such that max (Iv )<1.

ADD COMMENT EDIT