“The support vector machine (SVM) is a supervised learning method that generates input-output mapping functions from a set of labeled training data." A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the hyperplane are the support vectors.

**Algorithm:**

- Define an optimal hyperplane: maximize margin.
- Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications.
- Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space.

To define an optimal hyperplane we need to maximize the width of the margin (w).

The beauty of SVM is that if the data is linearly separable, there is a unique global minimum value. An ideal SVM analysis should produce a hyperplane that completely separates the vectors (cases) into two non-overlapping classes. However, perfect separation may not be possible, or it may result in a model with so many cases that the model does not classify correctly. In this situation SVM finds the hyperplane that maximizes the margin and minimizes the misclassifications.

**1. Maximum Margin Linear Separators**

For the maximum margin hyperplane only examples on the margin matter (only these affect the distances). These are called support vectors. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

**Hyperplanes**

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

**Support Vectors**

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM. It will be useful computationally if only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on.

The support vectors are indicated by the circles around them.

To find the maximum margin the separator, we have to solve following optimization problem:

$w.x^c+b\gt +1$ for positive cases

$w.x^c+b\lt -1$ for negative cases

and $||w||^2$ is as small as possible

2 Quadratic Programming Solution to Finding Maximum Margin Separators

**3. Kernels for Learning Non-Linear Functions**

Linear models are nice and interpretable but have limitations. Can’t learn difficult" nonlinear patterns.

Linear models rely on \linear" notions of similarity/distance

Sim$(x_n;x_m) = x\gt nx_m$

Dist $(x_n;x_m) =(x_n-x_m)^T (x_n-x_m)$

which wouldn’t work well if the patterns we want to learn are nonlinear.

**Kernels**

Kernels, using a feature mapping φ, map data to a new space where the original learning problem becomes \easy" (e.g., a linear model can be applied)