0
1.5kviews
Explain the data cleaning, data integration and transformation in detail
1 Answer
0
18views

Data Integration

  • Combines data from multiple sources into a coherent store

  • Schema integration: e.g., A.cust-id= B.cust-#

  • Integrate metadata from different sources

■ Entity identification problem:

  • Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

■ Detecting and resolving data value conflicts

  • For the same real world entity, attribute values from different sources are different

  • Possible reasons: different representations, different scales, e.g., metric vs. British units (e.g., GPA in US and China)

Handling Redundancy in Data Integration

■ Redundant data occur often when integration of multiple databases

• Object identification. The same attribute or object may have different names in different databases

• Derivable data: One attribute may be a "derived" attribute in another table, e.g., annual revenue

• Redundant attributes may be able to be detected by correlation analysis

• Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Correlation Analysis (Numerical Data)

• Correlation coefficient (also called Pearson's product moment coefficient)

$$ r_{A, B}=\frac{\sum(A-\bar{A})(B-\bar{B})}{(n-1) \sigma_{A} \sigma_{s}}=\frac{\sum(A B)-n \bar{A} \bar{B}}{(n-1) \sigma_{A} \sigma_{s}} $$

where $n$ is the number of tuples, $\bar{a}$ and are the respective means of $A$ and $B, \sigma_{A}$ and $\sigma_{B}$ are the respective standard deviation of $A$ and $B$, and $\Sigma(A B)$ is the sum of the $A B$ cross-product. - If $r_{A, B}>0, A$ and $B$ are positively correlated ( $A^{\prime} s$ values increase as B' s). The higher, the stronger correlation. - $r_{A, B}=0$ : independent; $r_{A, B}{ V B}{ }$ : negatively correlated

Data Transformation

■ Smoothing: remove noise from data

■ Aggregation: summarization

■ Generalization: concept hierarchy climbing

■ Normalization: scaled to fall within a small, specified range

■ min-max normalization

■ z-score normalization

■ normalization by decimal scaling

■ Attribute/feature construction

■ New attributes constructed from the given ones

Please log in to add an answer.