written 2.6 years ago by |
DATA REDUCTION STRATEGIES
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Attribute subset selection
irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed
Dimensionality reduction-e.g., remove unimportant attributes
Data Compression
Numerosity reduction - e.g., fit data into models
Discretization and concept hierarchy generation
DATA CUBE AGGREGATION
On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the annual sales
DIMENSIONALITY REDUCTION
Why attribute subset selection?
Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant.
For example,
if the task is to classify customers as to whether or not they are likely to purchase a popular new CD at AllElectronics when notified of a sale, attributes such as the customer's telephone number are likely to be irrelevant, unlike attributes such as age•Using domain expert to pick out some of the useful attributes
• Sometimes this can be a difficult and time-consuming task, especially when the behaviour of the data is not well known
Leaving out relevant attributes or keeping irrelevant attributes result in discovered patterns of poor quality.
• In addition, the added volume of irrelevant or redundant attributes can slow down the mining process.
ATTRIBUTE SUBSET SELECTION TECHNIQUES
• Step-wise forward selection
• Step-wise backward elimination
• Combining forward selection and backward elimination
Decision-tree induction
DIMENSIONALITY REDUCTION
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critualiumering outlier analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
-Wavelet transforms
Principal Component Analysis
DATA COMPRESSION
Data encoding or transformations are applied so as to obtain a reduced or compressed representation of original data
- Lossless - with out any loss of information
• Lossy - approximation of original data
WAVELET TRANSFORMS
The Discrete Wavelet Transform (DWT) is a linear signal processing technique.
•When applied to a data vector D, transforms it into a numerically different vector, D, of wavelets coefficients.
• All wavelet coefficients larger than some user defined threshold can be retained. The remaining
PRINCIPAL COMPONENTS ANALYSIS (PCA)
- Suppose that the data to be reduced consist of tuples or data vectors described by n attributes or dimensions Principal components analysis, or PCA searches for kn dimensional orthogonal vectors that can best be used to represent the data, where xs n.
The original data are thus projected onto a much smaller space, resulting in dimensionality reduction.
Unlike attribute subset selection, which reduces the attribute set size by retaining a subset of the initial set of attributes, PCA combines the essence of attributes by creating an alternative, smaller set of variables.
NUMEROSITY REDUCTION TECHNIQUES
Regression and Log-Linear Models
Histograms
- Clustering
• Sampling
REGRESSION AND LOG-LINEAR MODELS
In linear regression, the data are modeled to fit a straight line
For example, a random variable, y (called a response variable), can be modeled as a linear function of another random variable, x (called a predictor variable), with the equation
Y = a+Bx
*x andy are numerical database attributes.
The coefficients, a and B (called regression coefficients), specify the Y-intercept and slope of the line respectively.
. These coefficients can be solved for by the method of least squares
Multiple linear regression is an extension of (simple) linear regression,
which allows a response variable, y, to be modelled as a linear function of two or more predictor variables.
· Log-linear models is a technique used in statistics to examine the relationship between more than two categorical variables.
HISTOGRAMS
Histograms use binning to approximate data distributions and are a popular form of data reduction.
A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets.
If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.
The following data are a list of prices of commonly sold items at AllElectronics (rounded to the nearest dolor)
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
Histograms use binning to approximate data distributions and are a popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets.
If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.
CLUSTERING
-Clustering techniques consider data tuples as objects.
They partition the objects into groups or clusters, so that objects within a cluster are similar to one another and "dissimilar" to objects in other clusters.
•The "quality" of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster.
- Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid