An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

支持向量机(SVM)概述:凸优化与拉格朗日对偶问题

Posted by Gu on May 25, 2019

Just to clarify, these contents are mainly summarized from the course I took: “Fundamental of Big Data Analytics”, taught by Prof. Mathar Rudolf. For for information please visit: https://www.ti.rwth-aachen.de

In the last post we have conquered how to use gradient descent algorithm to train a SVM. So,

is this the end of the story?

Not really. Although using GD can solve the SVM optimization, GD has some shortcomings:

  • Gradient procedure is time consuming and the solution may be suboptimal.
  • GD method cannot explicitly identify support vectors (points) which determine the hyperplane.

To overcome these shortcomings, we can take advantage of the Lagrangian duality. First we convert original SVM optimization problem into a primal (convex) optimization problem, then we can get the Lagrangian dual problem. Luckily we can solve the dual problem based on KKT condition using more efficient methods.

First of all, we need to briefly introduce Lagrangian duality and Karush-Kuhn-Tucker (KKT) condition.

Lagrangian Duality Principle

Primal Problem
A primal convex optimization problem has the following expression:

$$ \min_{\mathbf{x}} f_0(\mathbf{x})\\ s.t. \ \ f_i(\mathbf{x}) \leq 0, \ i=1,\dots,n \\ \ \ \ \ \ \ \ h_j(\mathbf{x}) = 0, \ j=1,\dots,p $$

where $f_i(\mathbf{x}) _{(i=0,1,\dots,n)}$ are convex, and $h_j(\mathbf{x}) _{(j=1,\dots,p)}$ are linear (or affine).

We can get the Lagrangian function:

$$ L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu}) = f_0(\mathbf{x}) + \sum_{i=1}^{n}\lambda_{i}f_i(\mathbf{x}) + \sum_{j=1}^{p}\mu_jh_j(\mathbf{x}) $$

Since $f_i(\mathbf{x})$ are convex, and $h_j(\mathbf{x})$ are linear, \(L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})\) is also convex w.r.t \(\mathbf{x}\). Therefore, we can get the infimum of \(L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})\), which is called the Lagrangian dual function:

$$ g(\mathbf{\lambda},\mathbf{\mu})= \inf_\mathbf{x} \ L(\mathbf{x},\mathbf{\lambda},\mathbf{\mu}) $$

The difference between minimum and infimum:

  • \(\min(S)\) means the smallest element in set \(S\);
  • \(inf(S)\) means the largest value which is less than or equal to any element in \(S\).
  • In the case where the minimum value is reachable, infimum = minimum. e.g. \(S=\{\text{all natural number}\}\), then \(\inf(S) = \min(S) = 0\)
  • In the case where the minimum is not reachable, infimum may still exist. e.g. \(S=\{f(x)\vert f(x)=1/x, x>0\}\), \(\inf(S)=0\)

Dual Problem Based on the dual function we can get the dual optimization problem:

$$ \max_{\mathbf{\lambda},\mathbf{\mu}}\ g(\mathbf{\lambda},\mathbf{\mu})\\ s.t. \ \ \lambda_i \geq 0, \ \ i=1,\dots,n\\ \small\text{and other constraints introduced by computing the dual function} $$

Strong Duality and Slater’s Condition
Let $f_0^\star(x)$ and $g^\star(\mathbf{\lambda},\mathbf{\mu})$ be the primal optimum and dual optimum respectively. Weak duality means that \(g^\star(\mathbf{\lambda},\mathbf{\mu}) \leq f_0^\star(x)\) The difference \(f_0^\star(x)-g^\star(\mathbf{\lambda},\mathbf{\mu})\) is called duality gap.

Under certain circumstances, the duality gap can be 0, which means the strong duality holds. This condition is called Slater’s condition:

  • Apart from the constraints in primal problem, Slater’s condition requires that the constraints \(f_i(\mathbf{x}) _ {(i=1,\dots,n)}\) are linear (or affine).

If Slater’s condition is satisfied, strong duality holds, and furthermore for the optimal value \(\mathbf{x}^\star\), \(\mathbf{\lambda}^\star\) and \(\mathbf{\mu}^\star\), the Karush-Kuhn-Tucker (KKT) conditions also holds.

Karush-Kuhn-Tucker (KKT) Conditions
KKT conditions contain four conditions:

  1. primal constraints
    $$f_i(\mathbf{x}^\star)\leq 0, \ i=1,\dots,n \\ h_j(\mathbf{x}^\star)=0, \ j=1,\dots,p$$
  2. compute the infimum of \(L\) w.r.t \(\mathbf{x}\)
    $$\Delta_{\mathbf{x}} L(\mathbf{x}^\star, \mathbf{\lambda}^\star, \mathbf{\mu}^\star) = 0$$
  3. dual constraints
    $$ \lambda_i^\star\geq 0, \ i=1,\dots,n $$
  4. Complementary Slackness
    $$ \lambda_i^\star f_i(\mathbf{x}^\star) = 0, \ i=1,\dots,n $$

Therefore, if strong duality holds, we can first solve the dual problem and get the optimal \(\mathbf{\lambda}^\star\), \(\mathbf{\mu}^\star\). Then we can substitute the dual optimum into the KKT conditions (especially KKT condition 2) to get the primal optimum \(\mathbf{x}^\star\). Then the primal convex optimization problem can be solved.

Apply Lagrangian Duality to SVM

Now we are able to solve the SVM optimization problem using Lagrangian duality. As introduced in the first post An Introduction to Support Vector Machines (SVM): Basics, the SVM optimization problem is:

$$ \min_{\mathbf{w},b}\frac{1}{2}\|\mathbf{w}\|^2\\ s.t. \ \ y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1 $$

The Lagrangian dual function is

$$ L(\mathbf{w}, b , \mathbf{\lambda}) = \frac{1}{2}\|\mathbf{w}\|^2 + \sum_{i=1}^{n}\lambda_i(1-y_i(\mathbf{w}^T\mathbf{x}_i+b)) $$

To compute the Lagrangian dual function, we can compute the partial derivative of $L$ w.r.t \(\mathbf{w},b\) and set them to 0 (see KKT condition 2)

$$ \frac{\partial{L}}{\partial{\mathbf{w}}} = \mathbf{w} - \sum_{i=1}^{n}\lambda_i y_i \mathbf{x}_i = 0\\ \frac{\partial{L}}{\partial{b}} = -\sum_{i=1}^{n}\lambda_i y_i =0 $$

Then we get

$$ \mathbf{w}^\star = \sum_{i=1}^{n}\lambda_i y_i \mathbf{x}_i\\ \sum_{i=1}^{n}\lambda_i y_i = 0 $$

Substitute these two constraint equations into \(L(\mathbf{w},b,\mathbf{\lambda})\), we get the Lagrangian dual function:

$$ \begin{align} g(\mathbf{\lambda}) & = \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \sum_{i=1}^{n}\lambda_i(1-y_i( \sum_{j=1}^{n}\lambda_j y_j \mathbf{x_j}^T\mathbf{x}_i +b ))\\ & = \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j - \sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j + \sum_{i=1}^{n}\lambda_i - (\sum_{i=1}^{n}\lambda_i y_i)b \\ &= \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \end{align} $$

Then the dual problem is:

$$ \max_{\lambda} \ \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i\lambda_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j\\ \begin{align} s.t. \ \ &\lambda_i \geq 0, \ \ i=1,\dots,n\\ &\sum_{i=1}^{n} \lambda_i y_i = 0 \end{align} $$

We can solve this dual problem using Gradient descent algorithm or Sequential Minimal Optimization (SMO). This will be discussed in the next post.

Once we get the dual optimum \(\lambda^\star\), we can get the primal optimum \(\mathbf{w}^\star=\sum_{i=1}^{n} \lambda_i^\star y_i\mathbf{x}_ i\). But wait, how to get the optimal \(b^\star\)? To further understand this, we need analyze the KKT conditions for SVM optimization problem.

KKT conditions for SVM

Since the primal constraints \(1-y_i(\mathbf{w}^T\mathbf{x}_ i+b)\leq 0\) is obviously linear, so the Slater’s condition holds, strong duality holds, and the KKT conditions are satisfied for the primal optimum and dual optimum of the SVM. Therefore, we have the complementary slackness:

$$ \lambda_i^\star (1-y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star))=0, \ \ i=1,\dots,n $$

This looks interesting. From dual constraints we know that \(\lambda^\star\geq 0\). Together with this complementary slackness, we will know that if \(\lambda_i>0\), then it must hold \(y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1\). This means \(\mathbf{x}_i\) is exactly one of the support vectors (the points which have a margin distance to the separating hyperplane)!

Therefore, we find a way to identify support vectors using Lagrangian duality:

  • Compute the dual optimum, if \(\lambda_i^\star>0\), then \(\mathbf{x}_ i\) is a support vector.

Let \(S=\{i\vert \lambda^\star_i > 0\}\) represent the support vector set, \(S_+=\{i\vert i\in S\ \text{and}\ y_i=+1\}\) represent the subset whose labels are \(+1\), and \(S_-=\{i\vert i\in S\ \text{and}\ y_i=-1 \}\) represent the subset whose labels are -1. Then the primal optimum will be:

$$ \mathbf{w}^\star = \sum_{i\in S} \lambda_i^\star y_i \mathbf{x}_i\\ $$

Since we know for support vectors \(\mathbf{x}_i,\ i\in S\), it holds \(y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1\). \(y_i \in \{-1,+1\}\), so we get \({\mathbf{w}^\star}^T\mathbf{x}_i + b^\star= y_i\). Therefore, the primal optimum of \(b\) is:

$$ b^\star = y_i - {\mathbf{w}^\star}^T\mathbf{x}_i, \ \ i\in S $$

or

$$ b^\star = -\frac{1}{2}({\mathbf{w}^\star}^T\mathbf{x}_i + {\mathbf{w}^\star}^T\mathbf{x}_j ), \ \ i\in S_+,\ j \in S_- $$

In practice, in order to avoid influence of noise, we may use a more stable way to compute \(b^\star\):

$$ b^\star = \frac{1}{\vert S\vert} \sum_{i} \{ y_i - {\mathbf{w}^\star}^T\mathbf{x}_i \}, \ \ i\in S $$

Use SVM for Classification

Given a new point \(\mathbf{x}\), we can compute the value \({\mathbf{w}^\star}^T\mathbf{x}+b^\star\), and predict the label \(\hat{y}\) using hard decision or soft decision as shown in An Introduction to Support Vector Machines (SVM): Gradient Descent Solution. Substitute the expression of \({\mathbf{w}^\star}\), we have:

$$ {\mathbf{w}^\star}^T\mathbf{x} + b^\star = \sum_{i\in S}\lambda_i^\star y_i \mathbf{x}_i^T\mathbf{x} +b^\star $$

This implies that we only need the support vectors to determine the separating hyperplane and classify new points. Furthermore, we notice that either in the dual problem or in the classification, \(\mathbf{x}_i^T\mathbf{x}_j\) always appears as a whole. This feature can be used for Kernel SVM, which will be discussed in the following posts.

In the next post I will introduce how to solve the dual problem.