An Introduction to Support Vector Machines (SVM): Convex Optimization and Lagrangian Duality Principle

7 minute read

Published:

In the last post we have conquered how to use gradient descent algorithm to train a SVM. Although using GD can solve the SVM optimization, GD has some shortcomings:

  • Gradient procedure is time consuming and the solution may be suboptimal.
  • GD method cannot explicitly identify support vectors (points) which determine the hyperplane.

To overcome these shortcomings, we can take advantage of the Lagrangian duality. First we convert original SVM optimization problem into a primal (convex) optimization problem, then we can get the Lagrangian dual problem. Luckily we can solve the dual problem based on KKT condition using more efficient methods.

First of all, we need to briefly introduce Lagrangian duality and Karush-Kuhn-Tucker (KKT) condition.

Lagrangian Duality Principle

Primal Problem
A primal convex optimization problem has the following expression:

minxf0(x) s.t.  fi(x)0, i=1,,n        hj(x)=0, j=1,,p

where fi(x)(i=0,1,,n) are convex, and hj(x)(j=1,,p) are linear (or affine).

  • The constraint that fi(x)(i=0,1,,n) are convex defines a convex region.
  • The constraint hj(x)(j=1,,p) are linear confines the region into the intersections of multiple hyperplanes (potential reduces the dimensionality.)

We can get the Lagrangian function:

L(x,λ,μ)=f0(x)+i=1nλifi(x)+j=1pμjhj(x)

Since fi(x) are convex, and hj(x) are linear, L(x,λ,μ) is also convex w.r.t x. Therefore, we can get the infimum of L(x,λ,μ), which is called the Lagrangian dual function:

g(λ,μ)=infx L(x,λ,μ)

The difference between minimum and infimum:

  • min(S) means the smallest element in set S;
  • inf(S) means the largest value which is less than or equal to any element in S.
  • In the case where the minimum value is reachable, infimum = minimum. e.g. S={all natural number}, then inf(S)=min(S)=0
  • In the case where the minimum is not reachable, infimum may still exist. e.g. S={f(x)|f(x)=1/x,x>0}, inf(S)=0

Dual Problem Based on the dual function we can get the dual optimization problem:

maxλ,μ g(λ,μ) s.t. λi0,i=1,,n and other constraints introduced by computing the dual function

Strong Duality and Slater’s Condition
Let f0(x) and g(λ,μ) be the primal optimum and dual optimum respectively. Weak duality means that g(λ,μ)f0(x) The difference f0(x)g(λ,μ) is called duality gap.

Under certain circumstances, the duality gap can be 0, which means the strong duality holds. This condition is called Slater’s condition:

  • Apart from the constraints in primal problem, Slater’s condition requires that the constraints fi(x)(i=1,,n) are linear (or affine). This guarantees that there must exist an x, such that all strict inequality holds.

If Slater’s condition is satisfied, strong duality holds, and furthermore for the optimal value x, λ and μ, the Karush-Kuhn-Tucker (KKT) conditions also holds.

Karush-Kuhn-Tucker (KKT) Conditions
KKT conditions contain four conditions:

  1. primal constraints
fi(x)0, i=1,,n hj(x)=0, j=1,,p
  1. dual constraints
    λi0, i=1,,n
  2. Stationarity compute the infimum of L w.r.t x
    ΔxL(x,λ,μ)=0
  3. Complementary Slackness
    λifi(x)=0, i=1,,n

Therefore, if strong duality holds, we can first solve the dual problem and get the optimal λ, μ. Then we can substitute the dual optimum into the KKT conditions (especially KKT condition 2) to get the primal optimum x. Then the primal convex optimization problem can be solved.

Apply Lagrangian Duality to SVM

Now we are able to solve the SVM optimization problem using Lagrangian duality. As introduced in the first post An Introduction to Support Vector Machines (SVM): Basics, the SVM optimization problem is:

minw,b12w2 s.t.   yi(wTxi+b)1

The Lagrangian dual function is

L(w,b,λ)=12w2+i=1nλi(1yi(wTxi+b))

To compute the Lagrangian dual function, we can compute the partial derivative of L w.r.t w,b and set them to 0 (see KKT condition 2)

Lw=wi=1nλiyixi=0Lb=i=1nλiyi=0

Then we get

w=i=1nλiyixi i=1nλiyi=0

Substitute these two constraint equations into L(w,b,λ), we get the Lagrangian dual function:

g(λ)=12i,jλiλjyiyjxiTxj+i=1nλi(1yi(j=1nλjyjxjTxi+b))=12i,jλiλjyiyjxiTxji,jλiλjyiyjxiTxj+i=1nλi(i=1nλiyi)b=i=1nλi12i,jλiλjyiyjxiTxj

Then the dual problem is:

maxλ i=1nλi12i,jλiλjyiyjxiTxjs.t.  λi0,  i=1,,ni=1nλiyi=0

We can solve this dual problem using Gradient descent algorithm or Sequential Minimal Optimization (SMO). This will be discussed in the next post.

Once we get the dual optimum λ, we can get the primal optimum w=i=1nλiyixi. But wait, how to get the optimal b? To further understand this, we need analyze the KKT conditions for SVM optimization problem.

KKT conditions for SVM

Since the primal constraints 1yi(wTxi+b)0 is obviously linear, so the Slater’s condition holds, strong duality holds, and the KKT conditions are satisfied for the primal optimum and dual optimum of the SVM. Therefore, we have the complementary slackness:

λi(1yi(wTxi+b))=0,  i=1,,n

This looks interesting. From dual constraints we know that λ0. Together with this complementary slackness, we will know that if λi>0, then it must hold yi(wTxi+b)=1. This means xi is exactly one of the support vectors (the points which have a margin distance to the separating hyperplane)!

Therefore, we find a way to identify support vectors using Lagrangian duality:

  • Compute the dual optimum, if λi>0, then xi is a support vector.

Let S={i|λi>0} represent the support vector set, S+={i|iS and yi=+1} represent the subset whose labels are +1, and S={i|iS and yi=1} represent the subset whose labels are -1. Then the primal optimum will be:

w=iSλiyixi

Since we know for support vectors xi, iS, it holds yi(wTxi+b)=1. yi{1,+1}, so we get wTxi+b=yi. Therefore, the primal optimum of b is:

b=yiwTxi,  iS

or

b=12(wTxi+wTxj),  iS+, jS

In practice, in order to avoid influence of noise, we may use a more stable way to compute b:

b=1|S|i{yiwTxi},  iS

Use SVM for Classification

Given a new point x, we can compute the value wTx+b, and predict the label y^ using hard decision or soft decision as shown in An Introduction to Support Vector Machines (SVM): Gradient Descent Solution. Substitute the expression of w, we have:

wTx+b=iSλiyixiTx+b

This implies that we only need the support vectors to determine the separating hyperplane and classify new points. Furthermore, we notice that either in the dual problem or in the classification, xiTxj always appears as a whole. This feature can be used for Kernel SVM, which will be discussed in the following posts.

In the next post I will introduce how to solve the dual problem.