An Introduction to Support Vector Machines (SVM): kernel functions

支持向量机(SVM)概述:核函数

Posted by Gu on June 27, 2019

Recall of the Slack SVM dual problem:

Dual Problem

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:

$$ \mathbf{w}^\star = \sum_{i\in S_w}\lambda_i^\star y_i \mathbf{x}_i\\ b^\star= y_j - {\mathbf{w}^\star}^T\mathbf{x}_j = y_j - \sum_{i\in S_w}\lambda_i^\star y_i \mathbf{x}_i^T \mathbf{x}_j \ , \ j\in S_b\\ $$

Given a new point \(\mathbf{x}\), we can perform classification by computing:

$$ \begin{align} \hat{y} &= {\mathbf{w}^\star}^T \mathbf{x} + b^\star\\ &=\sum_{i\in S_w} \lambda^\star_i y_i \mathbf{x}_i^T \mathbf{x} + b^\star\\ \end{align} $$

According to the formulas above, we notice that in the dual problem, computation of \(\mathbf{w}^\star\) and classification of new points, \(\mathbf{x}_ i^T\mathbf{x}_ j\) always appears as a whole.

SVM with kernel functions

Mapping points to a higher dimensional space

In some cases, if the points is not linearly separable in current space, they are possibly linearly separable if we map them into the higher dimension.

Mapping points from 2d to 3d to make them linearly separable.

We define \(\phi(\mathbf{x}): R^p \rightarrow R^d\ ,\ d>p\) as a mapping function which maps low dimensional data to a high dimensional data. We can first map our data \(\mathbf{x}_ i \rightarrow \phi(\mathbf{x}_ i)\), then solve the dual problem:

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)\\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We notice that in the dual problem, computing \(\mathbf{w}^\star\) and performing classification, \(\phi(\mathbf{x}_ i)^T\phi(\mathbf{x}_ j)\) always appears as a whole. Therefore, we can avoid computing the exact form of \(\phi(\mathbf{x})\), but instead directly explore the function for the inner product of two mapped points \(K: R^p \times R^p \rightarrow R\):

$$ K_{i,j}=K(\mathbf{x}_i, \mathbf{x}_j)=<\phi(\mathbf{x}_i), \phi(\mathbf{x}_j)> $$

We call \(K(\mathbf{x}_i, \mathbf{x}_j)\) as the kernel function.

What is a valid kernel function?

A kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\) is valid if there exists a mapping function \(\phi\), such that it holds \(K_{i,j} = <\phi(\mathbf{x}_ i), \phi(\mathbf{x}_ j)>\) for any \(\mathbf{x}_ i, \mathbf{x}_ j\in R^p\).

Moreover, there is an equivalent conclusion on the validness of a kernel function.

A kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\) is valid if for any \(n\) samples \(\{ \mathbf{x}_ i \vert \mathbf{x}_ i \in R^p \}, i=1,\dots, n\), the kernel matrix \(\mathbf{K}=\begin{bmatrix}K_{1,1}, \dots, K_{1,n}\\\dots \\ K_{n,1},\dots, K_{n,n} \end{bmatrix}\) is non-negative definite.

Examples of Kernel functions

  1. Polynomial kernel function

    \[K(\mathbf{x}, \mathbf{y}) = ( \mathbf{x}^T\mathbf{y} +c )^d\]

    It can be proven that this function is equivalent to first mapping points to higher dimensional space and then computing the inner product.

  2. Gaussian Kernel

    \[K(\mathbf{x}, \mathbf{y}) = \exp\{ -\frac{ \|\mathbf{x}-\mathbf{y}\|^2 }{2{\epsilon}^2} \}\]

    Applying Gaussian kernel is equivalent to first mapping points to a infinitely high dimensional space and then computing the inner product. This can be understood by the Taylor expansion of the exponential function. For detailed explanation please see SVM中,高斯核为什么会把原始维度映射到无穷多维?

Dual problem with kernel function

With the definition of the kernel function, we can rewrite the dual problem and classification task as following.

Dual Problem

$$ \max_{\lambda, \mu} \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) \\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:

$$ \mathbf{w}^\star = \sum_{i\in S_w}\lambda_i^\star y_i \phi(\mathbf{x}_i)\\ b^\star= y_j - {\mathbf{w}^\star}^T\phi(\mathbf{x}_j) = y_j - \sum_{i\in S_w}\lambda_i^\star y_i K(\mathbf{x}_i, \mathbf{x}_j) \ , \ j\in S_b\\ $$

Given a new point \(\mathbf{x}\), we can perform classification by computing:

$$ \begin{align} \hat{y} &= {\mathbf{w}^\star}^T \phi(\mathbf{x}) + b^\star\\ &=\sum_{i\in S_w} \lambda^\star_i y_i K(\mathbf{x}_i, \mathbf{x}) + b^\star\\ \end{align} $$

See, in fact \(\mathbf{w}^\star\) is never really computed, since we are only interested in the kernel function!

Solve the dual problem using Gradient Descent Algorithm

We can solve the dual problem using gradient descent algorithm as introduced in the post An Introduction to Support Vector Machines (SVM): Dual problem solution using GD. Just simply select a kernel function, such as polynomial or Gaussian, compte the Kernel matrix \(\mathbf{K}\) for the training dataset, compute the gradient and then perform back propagation to get the dual optimum \(\lambda^\star\). After getting \(\lambda^\star\), we can compute the primal optimum \(b^\star\) and perform classification on new points using the equations above.

In the next post, I will introduce how to solve the dual problem using Sequential Minimal Optimization (SMO).