**1 Answer**

written 2.6 years ago by |

DATA REDUCTION STRATEGIES

Why data reduction?

A database/data warehouse may store terabytes of data

Complex data analysis/mining may take a very long time to run on the complete data set

Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategies

Data cube aggregation

Attribute subset selection

irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed

Dimensionality reduction-e.g., remove unimportant attributes

Data Compression

Numerosity reduction - e.g., fit data into models

Discretization and concept hierarchy generation

### DATA CUBE AGGREGATION

On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the annual sales

### DIMENSIONALITY REDUCTION

Why attribute subset selection?

Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant.

For example,

if the task is to classify customers as to whether or not they are likely to purchase a popular new CD at AllElectronics when notified of a sale, attributes such as the customer's telephone number are likely to be irrelevant, unlike attributes such as age•Using domain expert to pick out some of the useful attributes

• Sometimes this can be a difficult and time-consuming task, especially when the behaviour of the data is not well known

Leaving out relevant attributes or keeping irrelevant attributes result in discovered patterns of poor quality.

• In addition, the added volume of irrelevant or redundant attributes can slow down the mining process.

ATTRIBUTE SUBSET SELECTION TECHNIQUES

• Step-wise forward selection

• Step-wise backward elimination

• Combining forward selection and backward elimination

Decision-tree induction

DIMENSIONALITY REDUCTION

Curse of dimensionality

When dimensionality increases, data becomes increasingly sparse

Density and distance between points, which is critualiumering outlier analysis, becomes less meaningful

The possible combinations of subspaces will grow exponentially

Dimensionality reduction

Avoid the curse of dimensionality

Help eliminate irrelevant features and reduce noise

Reduce time and space required in data mining

Allow easier visualization

Dimensionality reduction techniques

-Wavelet transforms

Principal Component Analysis

DATA COMPRESSION

Data encoding or transformations are applied so as to obtain a reduced or compressed representation of original data

- Lossless - with out any loss of information

• Lossy - approximation of original data

WAVELET TRANSFORMS

The Discrete Wavelet Transform (DWT) is a linear signal processing technique.

•When applied to a data vector D, transforms it into a numerically different vector, D, of wavelets coefficients.

• All wavelet coefficients larger than some user defined threshold can be retained. The remaining

PRINCIPAL COMPONENTS ANALYSIS (PCA)

- Suppose that the data to be reduced consist of tuples or data vectors described by n attributes or dimensions Principal components analysis, or PCA searches for kn dimensional orthogonal vectors that can best be used to represent the data, where xs n.

The original data are thus projected onto a much smaller space, resulting in dimensionality reduction.

Unlike attribute subset selection, which reduces the attribute set size by retaining a subset of the initial set of attributes, PCA combines the essence of attributes by creating an alternative, smaller set of variables.

NUMEROSITY REDUCTION TECHNIQUES

Regression and Log-Linear Models

Histograms

- Clustering

• Sampling

REGRESSION AND LOG-LINEAR MODELS

In linear regression, the data are modeled to fit a straight line

For example, a random variable, y (called a response variable), can be modeled as a linear function of another random variable, x (called a predictor variable), with the equation

Y = a+Bx

*x andy are numerical database attributes.

The coefficients, a and B (called regression coefficients), specify the Y-intercept and slope of the line respectively.

. These coefficients can be solved for by the method of least squares

Multiple linear regression is an extension of (simple) linear regression,

which allows a response variable, y, to be modelled as a linear function of two or more predictor variables.

· Log-linear models is a technique used in statistics to examine the relationship between more than two categorical variables.

HISTOGRAMS

Histograms use binning to approximate data distributions and are a popular form of data reduction.

A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets.

If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.

The following data are a list of prices of commonly sold items at AllElectronics (rounded to the nearest dolor)

• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30

Histograms use binning to approximate data distributions and are a popular form of data reduction.

• A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets.

If each bucket represents only a single attribute-value/ frequency pair, the buckets are called singleton buckets.

CLUSTERING

-Clustering techniques consider data tuples as objects.

They partition the objects into groups or clusters, so that objects within a cluster are similar to one another and "dissimilar" to objects in other clusters.

•The "quality" of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster.

- Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid