**Numerosity Reduction**

This is a technique of choosing smaller forms or data representation to reduce the volume of data.

These techniques may be parametric or nonparametric.

Parametric:

For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.)

eg: Log-linear models, which estimate discrete multidimensional probability distributions.

Nonparametric:

Nonparametric methods are used for storing reduced representations of the data include histograms, clustering, and sampling.

Regression and Log-Linear Models

• Regression and log-linear models can be used to approximate the given data.

• In (simple) linear regression, the data are modeled to fit a straight line.

• Multiple linear regression is an extension of (simple) linear regression, which allows a response variable y to be modeled as a linear function of two or more predictor variables.

• Log-linear models approximate discrete multidimensional probability distributions.

• Log-linear models can be used to estimate the probability of each point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations.

• This allows a higher-dimensional data space to be constructed from lower dimensional spaces.

• Log-linear models are therefore also useful for dimensionality reduction and data smoothing

• Regression and log-linear models can both be used on sparse data, although their application may be limited.

• While both methods can handle skewed data, regression does exceptionally well. Regression can be computationally intensive when applied to high dimensional data, whereas log-linear models show good scalability for up to 10 or so dimensions.

Histograms • Histograms use binning to approximate data distributions and are a popular form of data reduction.

• A histogram partitions the data distribution into disjoint subsets, or buckets.

• If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets.

• Singleton buckets are useful for storing outliers with high frequency.

• Histograms are highly effective at approximating both sparse and dense data, aswell as highly skewed and uniform data.

• The histograms for single attributes can be extended for multiple attributes.

• Multidimensional histograms can capture dependencies between attributes.