Just to clarify, these contents are mainly summarized from the course I took: “Fundamental of Big Data Analytics”, taught by Prof. Mathar Rudolf. For for information please visit: https://www.ti.rwth-aachen.de
In the last post we have conquered how to use gradient descent algorithm to train a SVM. So,
is this the end of the story?
Not really. Although using GD can solve the SVM optimization, GD has some shortcomings:
- Gradient procedure is time consuming and the solution may be suboptimal.
- GD method cannot explicitly identify support vectors (points) which determine the hyperplane.
To overcome these shortcomings, we can take advantage of the Lagrangian duality. First we convert original SVM optimization problem into a primal (convex) optimization problem, then we can get the Lagrangian dual problem. Luckily we can solve the dual problem based on KKT condition using more efficient methods.
First of all, we need to briefly introduce Lagrangian duality and Karush-Kuhn-Tucker (KKT) condition.
Lagrangian Duality Principle
Primal Problem
A primal convex optimization problem has the following expression:
where $f_i(\mathbf{x}) _{(i=0,1,\dots,n)}$ are convex, and $h_j(\mathbf{x}) _{(j=1,\dots,p)}$ are linear (or affine).
We can get the Lagrangian function:
Since $f_i(\mathbf{x})$ are convex, and $h_j(\mathbf{x})$ are linear, \(L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})\) is also convex w.r.t \(\mathbf{x}\). Therefore, we can get the infimum of \(L(\mathbf{x}, \mathbf{\lambda}, \mathbf{\mu})\), which is called the Lagrangian dual function:
The difference between minimum and infimum:
- \(\min(S)\) means the smallest element in set \(S\);
- \(inf(S)\) means the largest value which is less than or equal to any element in \(S\).
- In the case where the minimum value is reachable, infimum = minimum. e.g. \(S=\{\text{all natural number}\}\), then \(\inf(S) = \min(S) = 0\)
- In the case where the minimum is not reachable, infimum may still exist. e.g. \(S=\{f(x)\vert f(x)=1/x, x>0\}\), \(\inf(S)=0\)
Dual Problem Based on the dual function we can get the dual optimization problem:
Strong Duality and Slater’s Condition
Let $f_0^\star(x)$ and $g^\star(\mathbf{\lambda},\mathbf{\mu})$ be the primal optimum and dual optimum respectively.
Weak duality means that
\(g^\star(\mathbf{\lambda},\mathbf{\mu}) \leq f_0^\star(x)\)
The difference \(f_0^\star(x)-g^\star(\mathbf{\lambda},\mathbf{\mu})\) is called duality gap.
Under certain circumstances, the duality gap can be 0, which means the strong duality holds. This condition is called Slater’s condition:
- Apart from the constraints in primal problem, Slater’s condition requires that the constraints \(f_i(\mathbf{x}) _ {(i=1,\dots,n)}\) are linear (or affine).
If Slater’s condition is satisfied, strong duality holds, and furthermore for the optimal value \(\mathbf{x}^\star\), \(\mathbf{\lambda}^\star\) and \(\mathbf{\mu}^\star\), the Karush-Kuhn-Tucker (KKT) conditions also holds.
Karush-Kuhn-Tucker (KKT) Conditions
KKT conditions contain four conditions:
- primal constraints
$$f_i(\mathbf{x}^\star)\leq 0, \ i=1,\dots,n \\ h_j(\mathbf{x}^\star)=0, \ j=1,\dots,p$$ - compute the infimum of \(L\) w.r.t \(\mathbf{x}\)
$$\Delta_{\mathbf{x}} L(\mathbf{x}^\star, \mathbf{\lambda}^\star, \mathbf{\mu}^\star) = 0$$ - dual constraints
$$ \lambda_i^\star\geq 0, \ i=1,\dots,n $$ - Complementary Slackness
$$ \lambda_i^\star f_i(\mathbf{x}^\star) = 0, \ i=1,\dots,n $$
Therefore, if strong duality holds, we can first solve the dual problem and get the optimal \(\mathbf{\lambda}^\star\), \(\mathbf{\mu}^\star\). Then we can substitute the dual optimum into the KKT conditions (especially KKT condition 2) to get the primal optimum \(\mathbf{x}^\star\). Then the primal convex optimization problem can be solved.
Apply Lagrangian Duality to SVM
Now we are able to solve the SVM optimization problem using Lagrangian duality. As introduced in the first post An Introduction to Support Vector Machines (SVM): Basics, the SVM optimization problem is:
The Lagrangian dual function is
To compute the Lagrangian dual function, we can compute the partial derivative of $L$ w.r.t \(\mathbf{w},b\) and set them to 0 (see KKT condition 2)
Then we get
Substitute these two constraint equations into \(L(\mathbf{w},b,\mathbf{\lambda})\), we get the Lagrangian dual function:
Then the dual problem is:
We can solve this dual problem using Gradient descent algorithm or Sequential Minimal Optimization (SMO). This will be discussed in the next post.
Once we get the dual optimum \(\lambda^\star\), we can get the primal optimum \(\mathbf{w}^\star=\sum_{i=1}^{n} \lambda_i^\star y_i\mathbf{x}_ i\). But wait, how to get the optimal \(b^\star\)? To further understand this, we need analyze the KKT conditions for SVM optimization problem.
KKT conditions for SVM
Since the primal constraints \(1-y_i(\mathbf{w}^T\mathbf{x}_ i+b)\leq 0\) is obviously linear, so the Slater’s condition holds, strong duality holds, and the KKT conditions are satisfied for the primal optimum and dual optimum of the SVM. Therefore, we have the complementary slackness:
This looks interesting. From dual constraints we know that \(\lambda^\star\geq 0\). Together with this complementary slackness, we will know that if \(\lambda_i>0\), then it must hold \(y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1\). This means \(\mathbf{x}_i\) is exactly one of the support vectors (the points which have a margin distance to the separating hyperplane)!
Therefore, we find a way to identify support vectors using Lagrangian duality:
- Compute the dual optimum, if \(\lambda_i^\star>0\), then \(\mathbf{x}_ i\) is a support vector.
Let \(S=\{i\vert \lambda^\star_i > 0\}\) represent the support vector set, \(S_+=\{i\vert i\in S\ \text{and}\ y_i=+1\}\) represent the subset whose labels are \(+1\), and \(S_-=\{i\vert i\in S\ \text{and}\ y_i=-1 \}\) represent the subset whose labels are -1. Then the primal optimum will be:
Since we know for support vectors \(\mathbf{x}_i,\ i\in S\), it holds \(y_i({\mathbf{w}^\star}^T\mathbf{x}_i+b^\star)=1\). \(y_i \in \{-1,+1\}\), so we get \({\mathbf{w}^\star}^T\mathbf{x}_i + b^\star= y_i\). Therefore, the primal optimum of \(b\) is:
or
In practice, in order to avoid influence of noise, we may use a more stable way to compute \(b^\star\):
Use SVM for Classification
Given a new point \(\mathbf{x}\), we can compute the value \({\mathbf{w}^\star}^T\mathbf{x}+b^\star\), and predict the label \(\hat{y}\) using hard decision or soft decision as shown in An Introduction to Support Vector Machines (SVM): Gradient Descent Solution. Substitute the expression of \({\mathbf{w}^\star}\), we have:
This implies that we only need the support vectors to determine the separating hyperplane and classify new points. Furthermore, we notice that either in the dual problem or in the classification, \(\mathbf{x}_i^T\mathbf{x}_j\) always appears as a whole. This feature can be used for Kernel SVM, which will be discussed in the following posts.
In the next post I will introduce how to solve the dual problem.
-
Previous
An Introduction to Support Vector Machines (SVM): Gradient Descent Solution -
Next
An Introduction to Support Vector Machines (SVM): Dual problem solution using GD