written 8.3 years ago by | modified 8.3 years ago by |
Data Preprocessing is required because:
Real world data are generally:
Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data
Noisy: Containing errors or outliers
Inconsistent: Containing discrepancies in codes or names
Steps in Data preprocessing:
1. Data cleaning:
- Data cleaning, also called data cleansing or scrubbing.
- Fill in missing values, smooth noisy data, identify or remove the outliers, and resolve inconsistencies.
- Data cleaning is required because source systems contain “dirty data” that must be cleaned.
- Steps in Data cleaning:
1.1 Parsing:
- Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
- Example includes parsing the first, middle and the last name.
1.2 Correcting:
- Correct parsed individual data components using sophisticated data algorithms and secondary data sources.
- Example includes replacing a vanity address and adding a zip code.
1.3 Standardizing:
- Standardizing applies conversion routines to transform data into its preferred and consistent format using both standard and custom business rules.
- Examples include adding a pre name, replacing a nickname.
1.4 Matching:
- Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
- Examples include identifying similar names and addresses.
1.5 Consolidating:
- Analyzing and identifying relationships between matched records and consolidating/merging them into one representation.
1.6 Data cleansing must deal with many types of possible errors:
- These include missing data and incorrect data at one source.
1.7 Data Staging:
- Accumulates data from asynchronous sources.
- At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse.
- There is usually no end user access to the staging file.
- An operational data store may be used for data staging.
2. Data integration and Transformation:
Data integration: Combines data from multiple sources into a coherent data store e.g. data warehouse.
Sources may include multiple databases, data cubes or data files.
Issues in data integration:
Schema integration:
- Integrate metadata from different sources.
- Entity identification problem: identify real world entities from multiple data sources, e.g. A cust-id=B.cust#.
Detecting and resolving data value conflicts:
- For the same real world entity, attribute values from different sources are different.
- Possible reasons: different representations, different scales.
- Redundant data occur often when integration of multiple databases:
- The same attribute may have different names in different databases.
- Data Transformation: Transformation process deals with rectifying any inconsistency (if any).
- One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is common for the given data element to be referred to by different data names in different databases.
- Eg Employee Name may be EMP_NAME in one database, ENAME in the other.
- Thus one set of Data Names are picked and used consistently in the data warehouse.
- Once all the data elements have right names, they must be converted to common formats.
3. Data Reduction:
- Obtains reduced representation in volume but produces the same or similar analytical results.
Need for data reduction:
- Reducing the number of attributes
- Reducing the number of attribute values
- Reducing the number of tuples
4. Discretization and Concept Hierarchy Generation(or summarization):
- Discretization: Reduce the number of values for a given continuous attribute by divide the range of a continuous attribute into intervals.
- Interval labels can then be used to replace actual data values.
- Concept Hierarchies: Reduce the data by collecting and replacing low level concepts(such as numeric values for the attribute age)by higher level concepts(such as young, middle-aged or senior).