An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

7 minute read

Published: May 25, 2019

In the last post we have conquered how to use gradient descent algorithm to train a SVM. Although using GD can solve the SVM optimization, GD has some shortcomings:

Gradient procedure is time consuming and the solution may be suboptimal.
GD method cannot explicitly identify support vectors (points) which determine the hyperplane.

To overcome these shortcomings, we can take advantage of the Lagrangian duality. First we convert original SVM optimization problem into a primal (convex) optimization problem, then we can get the Lagrangian dual problem. Luckily we can solve the dual problem based on KKT condition using more efficient methods.

First of all, we need to briefly introduce Lagrangian duality and Karush-Kuhn-Tucker (KKT) condition.

Lagrangian Duality Principle

Primal Problem
A primal convex optimization problem has the following expression:

min_{x} f_{0} (x)

s . t . f_{i} (x) \leq 0, i = 1, \dots, n

h_{j} (x) = 0, j = 1, \dots, p

where $f_{i} (x)_{(i = 0, 1, \dots, n)}$ are convex, and $h_{j} (x)_{(j = 1, \dots, p)}$ are linear (or affine).

The constraint that $f_{i} (x)_{(i = 0, 1, \dots, n)}$ are convex defines a convex region.
The constraint $h_{j} (x)_{(j = 1, \dots, p)}$ are linear confines the region into the intersections of multiple hyperplanes (potential reduces the dimensionality.)

We can get the Lagrangian function:

L (x, λ, μ) = f_{0} (x) + \sum_{i = 1}^{n} λ_{i} f_{i} (x) + \sum_{j = 1}^{p} μ_{j} h_{j} (x)

Since $f_{i} (x)$ are convex, and $h_{j} (x)$ are linear, $L (x, λ, μ)$ is also convex w.r.t $x$ . Therefore, we can get the infimum of $L (x, λ, μ)$ , which is called the Lagrangian dual function:

g (λ, μ) = inf_{x} L (x, λ, μ)

The difference between minimum and infimum:
$min (S)$ means the smallest element in set $S$ ;
$i n f (S)$ means the largest value which is less than or equal to any element in $S$ .
In the case where the minimum value is reachable, infimum = minimum. e.g. $S = {all natural number}$ , then $inf (S) = min (S) = 0$
In the case where the minimum is not reachable, infimum may still exist. e.g. $S = {f (x) | f (x) = 1 / x, x > 0}$ , $inf (S) = 0$

Dual Problem Based on the dual function we can get the dual optimization problem:

max_{λ, μ} g (λ, μ)

s.t.

λ_{i} \geq 0, i = 1, \dots, n

and other constraints introduced by computing the dual function

Strong Duality and Slater’s Condition
Let $f_{0}^{⋆} (x)$ and $g^{⋆} (λ, μ)$ be the primal optimum and dual optimum respectively. Weak duality means that $g^{⋆} (λ, μ) \leq f_{0}^{⋆} (x)$ The difference $f_{0}^{⋆} (x) - g^{⋆} (λ, μ)$ is called duality gap.

Under certain circumstances, the duality gap can be 0, which means the strong duality holds. This condition is called Slater’s condition:

Apart from the constraints in primal problem, Slater’s condition requires that the constraints $f_{i} (x)_{(i = 1, \dots, n)}$ are linear (or affine). This guarantees that there must exist an $x$ , such that all strict inequality holds.

If Slater’s condition is satisfied, strong duality holds, and furthermore for the optimal value $x^{⋆}$ , $λ^{⋆}$ and $μ^{⋆}$ , the Karush-Kuhn-Tucker (KKT) conditions also holds.

Karush-Kuhn-Tucker (KKT) Conditions
KKT conditions contain four conditions:

primal constraints

f_{i} (x^{⋆}) \leq 0, i = 1, \dots, n

h_{j} (x^{⋆}) = 0, j = 1, \dots, p

dual constraints $λ_{i}^{⋆} \geq 0, i = 1, \dots, n$
Stationarity compute the infimum of $L$ w.r.t $x$ $Δ_{x} L (x^{⋆}, λ^{⋆}, μ^{⋆}) = 0$
Complementary Slackness $λ_{i}^{⋆} f_{i} (x^{⋆}) = 0, i = 1, \dots, n$

Therefore, if strong duality holds, we can first solve the dual problem and get the optimal $λ^{⋆}$ , $μ^{⋆}$ . Then we can substitute the dual optimum into the KKT conditions (especially KKT condition 2) to get the primal optimum $x^{⋆}$ . Then the primal convex optimization problem can be solved.

Apply Lagrangian Duality to SVM

Now we are able to solve the SVM optimization problem using Lagrangian duality. As introduced in the first post An Introduction to Support Vector Machines (SVM): Basics, the SVM optimization problem is:

min_{w, b} \frac{1}{2} ‖ w ‖^{2}

s.t.

y_{i} (w^{T} x_{i} + b) \geq 1

The Lagrangian dual function is

L (w, b, λ) = \frac{1}{2} ‖ w ‖^{2} + \sum_{i = 1}^{n} λ_{i} (1 - y_{i} (w^{T} x_{i} + b))

To compute the Lagrangian dual function, we can compute the partial derivative of $L$ w.r.t $w, b$ and set them to 0 (see KKT condition 2)

\frac{\partial L}{\partial w} = w - \sum_{i = 1}^{n} λ_{i} y_{i} x_{i} = 0 \frac{\partial L}{\partial b} = - \sum_{i = 1}^{n} λ_{i} y_{i} = 0

Then we get

w^{⋆} = \sum_{i = 1}^{n} λ_{i} y_{i} x_{i}

\sum_{i = 1}^{n} λ_{i} y_{i} = 0

Substitute these two constraint equations into $L (w, b, λ)$ , we get the Lagrangian dual function:

\begin{aligned} g (λ) & = \frac{1}{2} \sum_{i, j} λ_{i} λ_{j} y_{i} y_{j} x_{i}^{T} x_{j} + \sum_{i = 1}^{n} λ_{i} (1 - y_{i} (\sum_{j = 1}^{n} λ_{j} y_{j} {x_{j}}^{T} x_{i} + b)) \\ = \frac{1}{2} \sum_{i, j} λ_{i} λ_{j} y_{i} y_{j} x_{i}^{T} x_{j} - \sum_{i, j} λ_{i} λ_{j} y_{i} y_{j} x_{i}^{T} x_{j} + \sum_{i = 1}^{n} λ_{i} - (\sum_{i = 1}^{n} λ_{i} y_{i}) b \\ = \sum_{i = 1}^{n} λ_{i} - \frac{1}{2} \sum_{i, j} λ_{i} λ_{j} y_{i} y_{j} x_{i}^{T} x_{j} \end{aligned}

Then the dual problem is:

max_{λ} \sum_{i = 1}^{n} λ_{i} - \frac{1}{2} \sum_{i, j} λ_{i} λ_{j} y_{i} y_{j} x_{i}^{T} x_{j} \begin{aligned} s . t . & λ_{i} \geq 0, i = 1, \dots, n \\ \sum_{i = 1}^{n} λ_{i} y_{i} = 0 \end{aligned}

We can solve this dual problem using Gradient descent algorithm or Sequential Minimal Optimization (SMO). This will be discussed in the next post.

Once we get the dual optimum $λ^{⋆}$ , we can get the primal optimum $w^{⋆} = \sum_{i = 1}^{n} λ_{i}^{⋆} y_{i} x_{i}$ . But wait, how to get the optimal $b^{⋆}$ ? To further understand this, we need analyze the KKT conditions for SVM optimization problem.

KKT conditions for SVM

Since the primal constraints $1 - y_{i} (w^{T} x_{i} + b) \leq 0$ is obviously linear, so the Slater’s condition holds, strong duality holds, and the KKT conditions are satisfied for the primal optimum and dual optimum of the SVM. Therefore, we have the complementary slackness:

λ_{i}^{⋆} (1 - y_{i} ({w^{⋆}}^{T} x_{i} + b^{⋆})) = 0, i = 1, \dots, n

This looks interesting. From dual constraints we know that $λ^{⋆} \geq 0$ . Together with this complementary slackness, we will know that if $λ_{i} > 0$ , then it must hold $y_{i} ({w^{⋆}}^{T} x_{i} + b^{⋆}) = 1$ . This means $x_{i}$ is exactly one of the support vectors (the points which have a margin distance to the separating hyperplane)!

Therefore, we find a way to identify support vectors using Lagrangian duality:

Compute the dual optimum, if $λ_{i}^{⋆} > 0$ , then $x_{i}$ is a support vector.

Let $S = {i | λ_{i}^{⋆} > 0}$ represent the support vector set, $S_{+} = {i | i \in S and y_{i} = + 1}$ represent the subset whose labels are $+ 1$ , and $S_{-} = {i | i \in S and y_{i} = - 1}$ represent the subset whose labels are -1. Then the primal optimum will be:

w^{⋆} = \sum_{i \in S} λ_{i}^{⋆} y_{i} x_{i}

Since we know for support vectors $x_{i}, i \in S$ , it holds $y_{i} ({w^{⋆}}^{T} x_{i} + b^{⋆}) = 1$ . $y_{i} \in {- 1, + 1}$ , so we get ${w^{⋆}}^{T} x_{i} + b^{⋆} = y_{i}$ . Therefore, the primal optimum of $b$ is:

b^{⋆} = y_{i} - {w^{⋆}}^{T} x_{i}, i \in S

b^{⋆} = - \frac{1}{2} ({w^{⋆}}^{T} x_{i} + {w^{⋆}}^{T} x_{j}), i \in S_{+}, j \in S_{-}

In practice, in order to avoid influence of noise, we may use a more stable way to compute $b^{⋆}$ :

b^{⋆} = \frac{1}{| S |} \sum_{i} {y_{i} - {w^{⋆}}^{T} x_{i}}, i \in S

Use SVM for Classification

Given a new point $x$ , we can compute the value ${w^{⋆}}^{T} x + b^{⋆}$ , and predict the label $\hat{y}$ using hard decision or soft decision as shown in An Introduction to Support Vector Machines (SVM): Gradient Descent Solution. Substitute the expression of $w^{⋆}$ , we have:

{w^{⋆}}^{T} x + b^{⋆} = \sum_{i \in S} λ_{i}^{⋆} y_{i} x_{i}^{T} x + b^{⋆}

This implies that we only need the support vectors to determine the separating hyperplane and classify new points. Furthermore, we notice that either in the dual problem or in the classification, $x_{i}^{T} x_{j}$ always appears as a whole. This feature can be used for Kernel SVM, which will be discussed in the following posts.

In the next post I will introduce how to solve the dual problem.

Share on

Twitter Facebook LinkedIn

Nianlong Gu

An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

Lagrangian Duality Principle

Apply Lagrangian Duality to SVM

KKT conditions for SVM

Use SVM for Classification

Share on

You May Also Enjoy

Training a Wasserstein GAN on the free google colab TPU

EM Algorithm and Gaussian Mixture Model for Clustering

An Introduction to Expectation-Maximization (EM) Algorithm

Maximum Likelihood Estimation (MLE)

An Introduction to Support Vector Machines (SVM): A Python Implementation