In the data mining process the data need to be pre-processed first to make them quality data to acquire the quality analysis and information to make quality decision.
Real world data are generally incomplete (lacking attribute values, lacking certain attributes of interest, or containing only aggregate data), Noisy ( containing errors or outliers) and Inconsistent (containing discrepancies in codes or names). so to prepare the data for mining by using following processes is known as data preprocessing
• Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
• Data integration: using multiple databases, data cubes, or files.
• Data transformation: normalization and aggregation.
• Data reduction: reducing the volume but producing the same or similar analytical results
Matching process involves eliminating duplications by searching and matching records with parsed, corrected and standardized data using some standard business rules. For example, identification of similar names and addresses.
Consolidation involves merging the records into one representation by analysing and identifying relationship between matched records.
3.Data cleansing must deal with many types of possible errors
Data can have many errors like missing data, or incorrect data at one source. When more than one source is involved there is a possibility of inconsistency and conflicting data.
Data staging is an interim step between data extraction and remaining steps. Using different processes like native interfaces, flat files, FTP sessions, data is accumulated from asynchronous sources. After a certain predefined interval data is loaded into the warehouse after the transformation process. No end user access is available to the staging file. For data staging, operational data store may be used.
In standardizing process conversion routines are used to transform data into a consistent format using both standard and custom business rules. For example, addition of a prename, replacing a nickname and using a preferred street name.