Why is Data Preprocessing required? Explain the different steps involved in Data Preprocessing
1 Answer

Data Preprocessing is required because:

  • Real world data are generally:

    Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data

    Noisy: Containing errors or outliers

    Inconsistent: Containing discrepancies in codes or names

  • Steps in Data preprocessing:

    1. Data cleaning:

    • Data cleaning, also called data cleansing or scrubbing.
    • Fill in missing values, smooth noisy data, identify or remove the outliers, and resolve inconsistencies.
    • Data cleaning is required because source systems contain “dirty data” that must be cleaned.
    • Steps in Data cleaning:

    1.1 Parsing:

    • Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
    • Example includes parsing the first, middle and the last name.

    1.2 Correcting:

    • Correct parsed individual data components using sophisticated data algorithms and secondary data sources.
    • Example includes replacing a vanity address and adding a zip code.

    1.3 Standardizing:

    • Standardizing applies conversion routines to transform data into its preferred and consistent format using both standard and custom business rules.
    • Examples include adding a pre name, replacing a nickname.

    1.4 Matching:

    • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
    • Examples include identifying similar names and addresses.

    1.5 Consolidating:

    • Analyzing and identifying relationships between matched records and consolidating/merging them into one representation.

    1.6 Data cleansing must deal with many types of possible errors:

    • These include missing data and incorrect data at one source.

    1.7 Data Staging:

    • Accumulates data from asynchronous sources.
    • At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse.
    • There is usually no end user access to the staging file.
    • An operational data store may be used for data staging.

enter image description here

2. Data integration and Transformation:

  • Data integration: Combines data from multiple sources into a coherent data store e.g. data warehouse.

  • Sources may include multiple databases, data cubes or data files.

    Issues in data integration:

    • Schema integration:

      • Integrate metadata from different sources.
      • Entity identification problem: identify real world entities from multiple data sources, e.g. A cust-id=B.cust#.
    • Detecting and resolving data value conflicts:

      • For the same real world entity, attribute values from different sources are different.
      • Possible reasons: different representations, different scales.
    • Redundant data occur often when integration of multiple databases:
      • The same attribute may have different names in different databases.

enter image description here

  • Data Transformation: Transformation process deals with rectifying any inconsistency (if any).
  • One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is common for the given data element to be referred to by different data names in different databases.
  • Eg Employee Name may be EMP_NAME in one database, ENAME in the other.
  • Thus one set of Data Names are picked and used consistently in the data warehouse.
  • Once all the data elements have right names, they must be converted to common formats.

enter image description here

3. Data Reduction:

  • Obtains reduced representation in volume but produces the same or similar analytical results.
  • Need for data reduction:

    • Reducing the number of attributes
    • Reducing the number of attribute values
    • Reducing the number of tuples

enter image description here

4. Discretization and Concept Hierarchy Generation(or summarization):

  • Discretization: Reduce the number of values for a given continuous attribute by divide the range of a continuous attribute into intervals.
  • Interval labels can then be used to replace actual data values.
  • Concept Hierarchies: Reduce the data by collecting and replacing low level concepts(such as numeric values for the attribute age)by higher level concepts(such as young, middle-aged or senior).
Please log in to add an answer.