An Introduction to Support Vector Machines (SVM): Basics

支持向量机(SVM)概述:线性支持向量机

Posted by Gu on May 23, 2019

Just to clarify, these contents are mainly summarized from the course I took: “Fundamental of Big Data Analytics”, taught by Prof. Mathar Rudolf. For for information please visit: https://www.ti.rwth-aachen.de

What is SVM?

Support Vector Machine (SVM) is a method for classification (and possibly for regression). Here we mainly discuss the most common application: binary classification problem.

Given a training dataset with binary classes {+1, -1}, SVM means to find a separating hyperplane which can maximize the margin.

Here the margin represents the minimum distance from points of both classes to the hyperplane. Support vectors represent the points which are closest to the hyperplane. These points determine the position and direction of the hyperplane, so they are called “support vectors”.

what is svm

Linear SVM

In the series of SVMs, following aspects will be discusses:

  1. The optimization problem of SVM in linearly separable case
  2. Using gradient descent algorithm to solve the SVM optimization problem
  3. Lagrangian dual optimization and Karush-Kuhn-Tucker (KKT) condition
  4. Dual problem of SVM and analysis
  5. Solve the dual problem using Sequential Minimal Optimization (SMO)
  6. The optimization problem of SVM with penalty term
  7. Kernel SVM for nonlinear cases

SVM in Linearly Separable Case

The linearly separable case means that in the training dataset, points with two different classes can be linearly separated by a hyperplane $H:\{\mathbf{x}|\mathbf{w}^{T}\mathbf{x}+b=0 \}$. The goal of SVM is to find the optimal parameters of $\hat{\mathbf{w}}$ and $\hat{b}$ which maximize the margin. See figure Linear-SVM.

Given a training dataset $\{(\mathbf{x}_i,\ y_i)\}, i=1,\dots,n$, where $\mathbf{x}\in{R^p}$ and $y\in\{-1,+1\}$. If such a separating hyperplane exists as shown in figure Linear-SVM, there exists a positive value, $\gamma>0$, which satisfies that

  • for any $y_i=+1$, $\mathbf{w}^T\mathbf{x}_i+b\geq \gamma$
  • for any $y_i=-1$, $\mathbf{w}^T\mathbf{x}_i+b\leq -\gamma$

Those two conditions can be summarized into one condition by considering $y_i$ and $\mathbf{x}_ {i}$ jointly:

  • $y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq \gamma \ \ \ $ for any $i=1,\dots,n$

The equality holds for marginal points (like the blue square, and red circles in figure Linear-SVM). This condition ensures that the hyperplane can correctly classify two classes in the training dataset, yet we have another condition in SVM which is to maximize the margin.

How is the margin is defined?
The margin is defined by the Euclidean distance between the closest points (marginal points) to the hyperplane. Since for marginal points $(\mathbf{x}, y)$ it holds $\mathbf{w}^T\mathbf{x}+b = \gamma$ or $\mathbf{w}^T\mathbf{x}+b = -\gamma$, and the hyperplane satisfies $\mathbf{w}^T\mathbf{x}+b = 0$, computing the margin is equivalent to computing the distance between two parallel hyperplane $h_1:\{\mathbf{x}|\mathbf{w}^T\mathbf{x}+b = \gamma\}$ and $h_2:\{\mathbf{w}^T\mathbf{x}+b = 0\}$. Therefore, the margin is \(\frac{\gamma}{\|\mathbf{w}\|}\). \(\|\mathbf{w}\|\) represent the norm of the vector \(\mathbf{w}\).

Recall of linear algebra:

  • the normal vector of hyperplane $h_1:\{\mathbf{x}|\mathbf{w}^T\mathbf{x}+b_1=0\}$ is \(\frac{\mathbf{w}}{\|\mathbf{w}\|}\)
  • the distance from origin $\mathbf{0}$ to hyperplane \(h_1:\{\mathbf{x}\mid\mathbf{w}^T\mathbf{x}+b_1=0\}\) is \(\frac{\vert b_1\vert}{\| \mathbf{w} \|}\).
    This can be proved by simply solve the equation: \(\mathbf{w}^T(\alpha \frac{\mathbf{w}}{\|\mathbf{w}\|})+b_1=0\), where \(\vert\alpha\vert\) is the distance
  • the distance between two parallel hyperplanes \(h_1:\{\mathbf{x}\vert\mathbf{w}^T\mathbf{x}+b_1=0\}\) and \(h_2:\{\mathbf{x}\vert\mathbf{w}^T\mathbf{x}+b_2=0\}\) is \(\frac{\vert b_1-b_2\vert}{\|\mathbf{w}\|}\)

Therefore, the SVM optimization problem is:

$$\text{max} \frac{\gamma}{\|\mathbf{w}\|},\ \ \text{s.t.}\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq\gamma\ ,\ \ i=1,\dots,n$$

note that here we cannot just maximize \(\gamma\), since the definition of a hyperplane \(h:\{\mathbf{x}\vert\mathbf{w}^T\mathbf{x}+b=\gamma\}\) is scale invariant. \(\{\mathbf{x}\vert\mathbf{w}^T\mathbf{x}+b=\gamma\}\) and \(\{\mathbf{x}\vert\alpha\mathbf{w}^T\mathbf{x}+\alpha b=\alpha\gamma\}\) represent the same hyperplane. Therefore, maximizing \(\gamma\) may make the magnitude of \(\mathbf{w}\) and \(b\) reach infinity, and has no contribution to maximizing the real margin.

It is the geodesic margin \(\frac{\gamma}{\|\mathbf{w}\|}\) that matters, not the functional margin \(\gamma\). Therefore, we can select an arbitrary positive value for \(\gamma\). This will only influence the scale of final optimal \(\hat{\mathbf{w}}\) and \(\hat{b}\), but the optimal separating hyperplane remains the same. For simplicity, we can set \(\gamma=1\). Then the SVM optimization problem is:

$$\text{max} \frac{1}{\|\mathbf{w}\|},\ \ \text{s.t.}\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1\ ,\ \ i=1,\dots,n$$

We can convert this problem into a convex optimization problem by reforming the objective function. The the final form of the SVM optimization problem is:

$$\text{min} \frac{1}{2}{\|\mathbf{w}\|^2},\ \ \text{s.t.}\ \ y_i(\mathbf{w}^T\mathbf{x}_ i+b)\geq 1\ ,\ \ i=1,\dots,n$$

This is the standard expression of the linear SVM. In the next post I will introduce how to solve this optimization problem and get the optimal separating hyperplane \(\{\mathbf{x}\vert\hat{\mathbf{w}}^T\mathbf{x}+\hat{b}=0\}\).