An Introduction to Support Vector Machines (SVM): Dual problem solution using GD

支持向量机(SVM)概述:使用梯度下降求解对偶问题

Posted by Gu on May 27, 2019

Just to clarify, these contents are mainly summarized from the course I took: “Fundamental of Big Data Analytics”, taught by Prof. Mathar Rudolf. For for information please visit: https://www.ti.rwth-aachen.de

Recall of the SVM primal problem and dual problem:
Primal Problem

$$ \min_{\mathbf{w},b}\ \frac{1}{2}\|\mathbf{w}\|^2\\ \begin{align} s.t.\ \ & y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq 1,\ i=1,\dots,n \end{align} $$

Dual Problem

$$ \max_{\lambda}\ \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\\ \begin{align} s.t.\ \ & \lambda_i \geq 0,\ i=1,\dots,n\\ & \sum_{i=1}^{n}\lambda_i y_i = 0 \end{align} $$

The the last post we introduced how to apply Lagrangian duality to SVM and how to get the primal optimum once we get the dual optimum. In this post we mainly discuss how to solve the dual problem and get the dual optimum.

Gradient Descent Algorithm for Dual Problem

To apply GD to SVM, we need to reformulate the objective function of the dual problem. Our new objective function will be:

$$ \min_{\lambda}L(\lambda)=-\sum_{i=1}^{n}\lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 \\ s.t. \ \ \lambda_i\geq 0 $$

where \(c>0\) is the weighting factor for the constraint \(\sum_{i=1}^{n}\lambda_i y_i = 0\). For the constraint \(\lambda_i\geq 0\), we can satisfy this constraint by clipping \(\lambda\) into the region \([0,\infty)\) after each back propagation during gradient descent.

Discussion: why not also put the constraints \(\lambda_i\geq 0\) also into the loss function by introducing an extra hinge loss term? Then the final loss function will be: \(\min_{\lambda}L(\lambda)=-\sum_{i=1}^{n}\lambda_i + \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \frac{c}{2}(\sum_{i=1}^{n}\lambda_i y_i)^2 + d \sum_{i=1}^{n}\text{max}\{-\lambda_i,0\}\)

This is reasonable in theory but not so feasible in practice. This will introduce one extra hyper parameter \(d\), and we will be lost in endlessly fine tuning and balancing the hyper parameters \(c\) and \(d\). Test results also show that achieving the constraint \(\lambda_i\geq 0\) using clipping is efficient and this method also easily support more general cases of SVM with penalty terms. This will be discussed later.

Based on the loss function, We can compute the gradient:

$$ \begin{align} \frac{\partial{L}}{\partial{\lambda_i}} &= -1 + y_i \sum_{j=1}^{n}\lambda_j y_j \mathbf{x}_i^T\mathbf{x}_j + {c}\sum_{j=1}^{n}\lambda_j y_j y_i \end{align} $$

We define a function \(K(\mathbf{x}_i, \mathbf{x}_j)= \mathbf{x}_i^T\mathbf{x}_j\). To maintain the consistence with future posts, we can this function as kernel function. Given a training dataset \(\{\mathbf{x}_i\}, i=1,\dots,n\), we can get a kernel matrix:

$$ \mathbf{K} = \begin{bmatrix}K_{1,1}\dots K_{1,n}\\ \dots \\ K_{n,1}\dots {K_{n,n}} \end{bmatrix} $$

where \(K_{i,j}=K(\mathbf{x}_i, \mathbf{x}_j)\).
Then the gradient \(\frac{\partial{L}}{\partial{\lambda_i}}\) can be expressed by the kernel matrix:

$$ \frac{\partial{L}}{\partial{\lambda_i}} = -1 + y_i \mathbf{e}_i^T \mathbf{K} ( \lambda \circ \mathbf{y} ) + c y_i \lambda ^T \mathbf{y} $$

where \(\mathbf{e}_i=[0,\dots,0,1,0,\dots,0]\), with the \(i^{th}\) element being 1 and other elements being 0. The sign \(\lambda \circ \mathbf{y}\) represents the element-wise multiplication two vectors \(\lambda\) and \(\mathbf{y}\).

We can also write the expression of the gradient of \(L\) with respect to the whole vector \(\lambda\):

$$ \frac{\partial{L}}{\partial{\lambda}} = -\mathbf{1}_n + (\mathbf{K}(\lambda \circ \mathbf{y}))\circ \mathbf{y} + c(\lambda^T\mathbf{y})\mathbf{y} $$

In practice, when we implement the gradient descent algorithm, we don’t need to compute \(\mathbf{K}\) in each iteration, since \(\mathbf{K}\) does not rely on \(\lambda\). Instead, we can simply compute \(\mathbf{K}\) before applying gradient descent and store it in the memory, and call it each time when computing the gradient.

Another implicit advantage of using such a kernel matrix expression is that such a definition can be extended into a broader definition of SVM – SVM with kernels, where we can give a more sophisticated definition to the kernel function \(K(\mathbf{x}_ i, \mathbf{x}_ j)\), instead of just vector dot product. But even in that case, the expression of the gradient still remains the same. We just simply pre-calculate the kernel matrix \(\mathbf{K}\) based on the new definition of kernel function, and then apply gradient descent algorithm to find the optimal solution. We will discuss kernel SVM in the future posts.

Implementation and Experiments

I implement the Gradient Descent algorithm to compute the dual optimum and use it to solve the original SVM optimization problem. The code is available in my github SupportVectorMachine/gd-dual-svm.py. The change of the hyperplane over iterations is shown in figure Hyperplane Over Iteration

Hyperplane Over Iteration

In the above figure, the points with solid color are the support vectors. As the training goes on, more and more points are excluded from the support vector set. Finally there are only 3 support vectors. The finally separating hyperplane is obviously the optimal separating hyperplane with maximized margin.

Other Solutions?

One important feature of the Gradient Descent Algorithm is that in each iteration there is a matrix vector multiplication \(\mathbf{K}(\lambda \circ \mathbf{y})\), with a time complexity \(O(n^2)\). This might be computationally challenging if \(n\) is large.

Apart from the gradient descent method, there is another method called Sequential Minimal Optimization (SMO), which is a more efficient and specialized solution. We will discuss that in the following posts. Before we go further, I would like to introduce the SVM in more general cases.