What is data preprocessing? Explain the different methods for the data cleansing phase.-

Subject: Data Mining And Business Intelligence

Topic: Data Preprocessing

Difficulty: Medium

dmbi(26) • 2.5k  views

In the data mining process the data need to be pre-processed first to make them quality data to acquire the quality analysis and information to make quality decision.

Real world data are generally incomplete (lacking attribute values, lacking certain attributes of interest, or containing only aggregate data), Noisy ( containing errors or outliers) and Inconsistent (containing discrepancies in codes or names). so to prepare the data for mining by using following processes is known as data preprocessing

• Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

• Data integration: using multiple databases, data cubes, or files.

• Data transformation: normalization and aggregation.

• Data reduction: reducing the volume but producing the same or similar analytical results

enter image description here


Matching process involves eliminating duplications by searching and matching records with parsed, corrected and standardized data using some standard business rules. For example, identification of similar names and addresses.


Consolidation involves merging the records into one representation by analysing and identifying relationship between matched records.

3.Data cleansing must deal with many types of possible errors

Data can have many errors like missing data, or incorrect data at one source. When more than one source is involved there is a possibility of inconsistency and conflicting data.

4.Data staging

Data staging is an interim step between data extraction and remaining steps. Using different processes like native interfaces, flat files, FTP sessions, data is accumulated from asynchronous sources. After a certain predefined interval data is loaded into the warehouse after the transformation process. No end user access is available to the staging file. For data staging, operational data store may be used.


In standardizing process conversion routines are used to transform data into a consistent format using both standard and custom business rules. For example, addition of a prename, replacing a nickname and using a preferred street name.

Please log in to add an answer.

Next up

Read More Questions

If you are looking for answer to specific questions, you can search them here. We'll find the best answer for you.


Study Full Subject

If you are looking for good study material, you can checkout our subjects. Hundreds of important topics are covered in them.

Know More