0
Expectation-Maximization (EM) Algorithm & Radial Basis Functions

1. Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood estimates for model parameters when our data is incomplete, has missing data points, or has unobserved (hidden) latent variables. It is an iterative way to approximate the maximum likelihood function. While maximum likelihood estimation can find the “best fit” model for a set of data, it doesn’t work particularly well for incomplete data sets. The more complex EM algorithm can find model parameters even if you have missing data. It works by choosing random values for the missing data points, and using those guesses to estimate a second set of data. The new values are used to create a better guess for the first set, and the process continues until the algorithm converges on a fixed point.

The Expectation Maximization (EM) algorithm can be used to generate the best hypothesis for the distributional parameters of some multi-modal data. Note that we say ‘the best’ hypothesis. But what is ‘the best’? The best hypothesis for the distributional parameters is the maximum likelihood hypothesis – the one that maximizes the probability that this data we are looking at comes from K distributions, each with a mean $m_k$ and variance sigma$_k^2$. Here, we are assuming that we are dealing with K normal distributions.

In a single modal normal distribution this hypothesis h is estimated directly from the data as:

estimated $m =m~ = sum(x_i)/N$.........(1)

estimated $sigma2= sigma2~= sum(xi- m~)^2/N$........(2)

Which are simply the trusted arithmetic average and variance. In a multi-modal distribution we need to estimate $h = [ m_1,m_2,...,m_K;$ sigma$_1^2$,sigma$_2^2$,...,sigma$_K^2$ ]. The EM algorithm is going to help us to do this. Let’s see how.

We begin with some initial estimate for each $m_k~$ and sigma$_k^2~$. We will have a total of K estimates for each parameter. The estimates can be taken from the plots which we can make, our domain knowledge, or they even can be wild (but not too wild) guesses. We then proceed to take each data point and answer the following question – what is the probability that this data point was generated from a normal distribution with mean $m_k~$ and sigma$_k^2~?$ That is, we repeat this question for each set of our distributional parameters. Thus we need to answer these questions twice – what is the probability that a data point $x_i, i=1,...N$, was drawn from $N(m_1~,$ sigma$_1^2~)$ and what is the probability that it was drawn from $N(m_2~,$ sigma$_2^2~)$. By the normal density function we get:

P($x_i$ belongs to $N(m_1~ ,$ sigma$_1^2~)$)=1/sqrt(2pi sigma$_1^2~$) * exp(-($x_i- m_1~$)^2/(2*sigma$_1^2~$))........ (3)

P($x_i$ belongs to $N(m_2~ ,$ sigma$_2^2~)$)=1/sqrt(2pi sigma$_2^2~$) * exp(-($x_i- m_2~$)^2/(2*sigma$_2^2~$)) ....... (4)

The individual probabilities only tell us half of the story because we still need to take into account the probability of picking N($m_1$~, sigma$_1^2$~) or N($m_2$~, sigma$_2^2$~) to draw the data from. We now arrive at what is known as responsibilities of each distribution for each data point. In a classification task this responsibility can be expressed as the probability that a data point $x_i$ belongs to some class $c_k$:

P($x_i$ belongs to $c_k$) = omega$_k$~ * P($x_i$ belongs to N($m_1$~ , sigma$_1^2$~)) / sum(omega$_k$~ * P($x_i$ belongs to N($m_1$~ , sigma$_1^2$~))).........(5)

In Equation 5 we introduce a new parameter omega$_k$~ which is the probability of picking k’s distribution to draw the data point from. Figure 1 indicates that each of our two clusters are equally likely to be picked. But like with m$_k$~ and sigma$k_2$~ we do not really know the value for this parameter. Therefore, we need to guess it and it is a part of our hypothesis:

h = [ $m_1, m_2, ..., m_K$; sigma$_1^2$, sigma$_2^2$, ..., sigma$_K^2$; omega$_1$~, omega$_2$~, ..., omega$_K$~ ]..........(6)

The denominator in Equation 5 is the sum of probabilities of observing $x_i$ in each cluster weighted by that cluster’s probability. Essentially, it is the total probability of observing $x_i$ in our data.

If we are making hard cluster assignments, we will take the maximum P($x_i$ belongs to $c_k$) and assign the data point to that cluster. We repeat this probabilistic assignment for each data point. In the end this will give us the first data ‘re-shuffle’ into K clusters. We are now in a position to update the initial estimates for h to h'. These two steps of estimating the distributional parameters and updating them after probabilistic data assignments to clusters is repeated until convergences to h*.

In summary, the two steps of the EM algorithm are:

• E-step: perform probabilistic assignments of each data point to some class based on the current hypothesis h for the distributional class parameters;
• M-step: update the hypothesis h for the distributional class parameters based on the new data assignments.

During the E-step we are calculating the expected value of cluster assignments. During the M-step we are calculating a new maximum likelihood for our hypothesis.

Applications

The EM algorithm has many applications, including:

• Dis-entangling superimposed signals,
• Estimating Gaussian mixture models (GMMs),
• Estimating hidden Markov models (HMMs),
• Estimating parameters for compound Dirichlet distributions,
• Finding optimal mixtures of fixed models.

Limitations

The EM algorithm can be very slow, even on the fastest computer. It works best when you only have a small percentage of missing data and the dimensionality of the data isn’t too big. The higher the dimensionality, the slower the E-step; for data with larger dimensionality, we may find the E-step runs extremely slow as the procedure approaches a local maximum.

Radial basis functions are means to approximate multivariable (also called multivariate) functions by linear combinations of terms based on a single univariate function (the radial basis function). This is radialised so that in can be used in more than one dimension. They are usually applied to approximate functions or data which are only known at a finite number of points (or too difficult to evaluate otherwise), so that then evaluations of the approximating function can take place often and efficiently.

Primarily in computational applications, functions of many variables often need to be approximated by other functions that are better understood or more readily evaluated. This may be for the purpose of displaying them frequently on a computer screen for instance, so computer graphics are a field of practical use. Radial basis functions are one efficient, frequently used way to do this. Further applications include the important fields of neural networks and learning theory. Since they are radially symmetric functions which are shifted by points in multidimensional Euclidean space and then linearly combined, they form data-dependent approximation spaces. This data-dependence makes the spaces so formed suitable for providing approximations to large classes of given functions. It also opens the door to existence and uniqueness results for interpolating scattered data by radial basis functions in very general settings (in particular in many dimensions).

Indeed, one of the greatest advantages of this method lies in its applicability in almost any dimension (whence its versatility) because there are generally little restrictions on the way the data are prescribed.

RBF Network Architecture

An RBFN performs classification by measuring the input’s similarity to examples from the training set. Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set. When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype. Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

The above illustration shows the typical architecture of an RBF Network. It consists of an input vector, a layer of RBF neurons, and an output layer with one node per category or class of data.

The Input Vector

The input vector is the n-dimensional vector that you are trying to classify. The entire input vector is shown to each of the RBF neurons.

The RBF Neurons

Each RBF neuron stores a “prototype” vector which is just one of the vectors from the training set. Each RBF neuron compares the input vector to its prototype, and outputs a value between 0 and 1 which is a measure of similarity. If the input is equal to the prototype, then the output of that RBF neuron will be 1. As the distance between the input and prototype grows, the response falls off exponentially towards 0. The shape of the RBF neuron’s response is a bell curve, as illustrated in the network architecture diagram.

The neuron’s response value is also called its “activation” value.

The prototype vector is also often called the neuron’s “center”, since it’s the value at the center of the bell curve.

The Output Nodes

The output of the network consists of a set of nodes, one per category that we are trying to classify. Each output node computes a sort of score for the associated category. Typically, a classification decision is made by assigning the input to the category with the highest score.

The score is computed by taking a weighted sum of the activation values from every RBF neuron. By weighted sum we mean that an output node associates a weight value with each of the RBF neurons, and multiplies the neuron’s activation by this weight before adding it to the total response.

Because each output node is computing the score for a different category, every output node has its own set of weights. The output node will typically give a positive weight to the RBF neurons that belong to its category, and a negative weight to the others.

BF Neuron Activation Function

Each RBF neuron computes a measure of the similarity between the input and its prototype vector (taken from the training set). Input vectors which are more similar to the prototype return a result closer to 1. There are different possible choices of similarity functions, but the most popular is based on the Gaussian. Below is the equation for a Gaussian with a one-dimensional input.

$f(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-u)^2}{2\sigma ^2}}$

Where x is the input, mu is the mean, and sigma is the standard deviation. This produces the familiar bell curve shown below, which is centered at the mean, mu (in the below plot the mean is 5 and sigma is 1).

The RBF neuron activation function is slightly different, and is typically written as:

$\phi (x)=e^{-\beta ||x-\mu||^2}$

Radial basis function (RBF) networks have advantages of easy design, good generalization, strong tolerance to input noise, and online learning ability. The properties of RBF networks make it very suitable to design flexible control systems.