Recall of the Slack SVM dual problem:
Dual Problem
Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:
Given a new point \(\mathbf{x}\), we can perform classification by computing:
According to the formulas above, we notice that in the dual problem, computation of \(\mathbf{w}^\star\) and classification of new points, \(\mathbf{x}_ i^T\mathbf{x}_ j\) always appears as a whole.
SVM with kernel functions
Mapping points to a higher dimensional space
In some cases, if the points is not linearly separable in current space, they are possibly linearly separable if we map them into the higher dimension.
We define \(\phi(\mathbf{x}): R^p \rightarrow R^d\ ,\ d>p\) as a mapping function which maps low dimensional data to a high dimensional data. We can first map our data \(\mathbf{x}_ i \rightarrow \phi(\mathbf{x}_ i)\), then solve the dual problem:
We notice that in the dual problem, computing \(\mathbf{w}^\star\) and performing classification, \(\phi(\mathbf{x}_ i)^T\phi(\mathbf{x}_ j)\) always appears as a whole. Therefore, we can avoid computing the exact form of \(\phi(\mathbf{x})\), but instead directly explore the function for the inner product of two mapped points \(K: R^p \times R^p \rightarrow R\):
We call \(K(\mathbf{x}_i, \mathbf{x}_j)\) as the kernel function.
What is a valid kernel function?
A kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\) is valid if there exists a mapping function \(\phi\), such that it holds \(K_{i,j} = <\phi(\mathbf{x}_ i), \phi(\mathbf{x}_ j)>\) for any \(\mathbf{x}_ i, \mathbf{x}_ j\in R^p\).
Moreover, there is an equivalent conclusion on the validness of a kernel function.
A kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\) is valid if for any \(n\) samples \(\{ \mathbf{x}_ i \vert \mathbf{x}_ i \in R^p \}, i=1,\dots, n\), the kernel matrix \(\mathbf{K}=\begin{bmatrix}K_{1,1}, \dots, K_{1,n}\\\dots \\ K_{n,1},\dots, K_{n,n} \end{bmatrix}\) is non-negative definite.
Examples of Kernel functions
-
Polynomial kernel function
\[K(\mathbf{x}, \mathbf{y}) = ( \mathbf{x}^T\mathbf{y} +c )^d\]It can be proven that this function is equivalent to first mapping points to higher dimensional space and then computing the inner product.
-
Gaussian Kernel
\[K(\mathbf{x}, \mathbf{y}) = \exp\{ -\frac{ \|\mathbf{x}-\mathbf{y}\|^2 }{2{\epsilon}^2} \}\]Applying Gaussian kernel is equivalent to first mapping points to a infinitely high dimensional space and then computing the inner product. This can be understood by the Taylor expansion of the exponential function. For detailed explanation please see SVM中,高斯核为什么会把原始维度映射到无穷多维?
Dual problem with kernel function
With the definition of the kernel function, we can rewrite the dual problem and classification task as following.
Dual Problem
Suppose that we have solved the dual problem and get the dual optimum. Let \(S_w=\{ i \vert 0<\lambda_i^\star \leq C \}\) represent the support set related with \(\mathbf{w}\); \(S_b=\{ i \vert 0<\lambda_i^\star < C \}\) represent the support set related with \(b\). Meanwhile, we define \(S_b^+ =\{ i \vert i\in S_b \ \text{and}\ y_i = +1 \}\) and \(S_b^-=\{ i \vert i\in S_b\ \text{and}\ y_i = -1 \}\). Then we can compute the primal optimum:
Given a new point \(\mathbf{x}\), we can perform classification by computing:
See, in fact \(\mathbf{w}^\star\) is never really computed, since we are only interested in the kernel function!
Solve the dual problem using Gradient Descent Algorithm
We can solve the dual problem using gradient descent algorithm as introduced in the post An Introduction to Support Vector Machines (SVM): Dual problem solution using GD. Just simply select a kernel function, such as polynomial or Gaussian, compte the Kernel matrix \(\mathbf{K}\) for the training dataset, compute the gradient and then perform back propagation to get the dual optimum \(\lambda^\star\). After getting \(\lambda^\star\), we can compute the primal optimum \(b^\star\) and perform classification on new points using the equations above.
In the next post, I will introduce how to solve the dual problem using Sequential Minimal Optimization (SMO).
-
Previous
An Introduction to Support Vector Machines (SVM): SVM with slack variables -
Next
An Introduction to Support Vector Machines (SVM): Sequential Minimal Optimization (SMO)