An Introduction to Support Vector Machines (SVM): Sequential Minimal Optimization (SMO)

Recall the Kernel SVM dual problem:

Dual Problem

$$ \max_{\lambda, \mu} L(\lambda)= \sum_{i=1}^{n}\lambda_i - \frac{1}{2}\sum_{i,j}\lambda_i \lambda_j y_i y_j K_{i,j} \\ \begin{align} s.t.\ & 0 \leq \lambda_i \leq C \\ &\sum_{i=1}^{n} \lambda_i y_i =0 \end{align} $$

We have introduced using gradient descent algorithm to solve the dual problem. However, the computation of the gradient has a high time complexity and thus would be a challenge for memory, especially when the training dataset is large. In this post, I introduce an efficient and light-version algorithm to solve the dual problem: Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO)

The algorithm of SMO is:

Initialization: let $\{\lambda_i\}, i=1,\dots,n$ be a set which satisfies the dual constraint.
Repeat:
$\ \ \ \$(1) heuristically select two $\lambda_a, \lambda_b$, and set all the other $\lambda_i (i\neq a,b)$ fixed;
$\ \ \ \$(2) optimize $L(\lambda)$ with respect to $\lambda_a, \lambda_b$;
Until: KKT condition is satisfied with certain accuracy.

First question about the initialization: how to find a set $\{\lambda_i\}$ which satisfies the dual constraints?
The answer is simply set $\lambda_i=0$ for $i=0,\dots,n$.

Suppose that we have finished the initialization, and pick up a pair $\lambda_a, \lambda_b$ to optimize while keeping $\lambda_i (i\neq a,b)$ fixed, then we have

$$ \begin{align} L(\lambda) =& \lambda_a + \lambda_b -\frac{1}{2} \lambda_a^2 K_{a,a} - \frac{1}{2} \lambda_b^2 K_{b,b} - \lambda_a \lambda_b y_a y_b K_{a,b} \\ & - \sum_{i\neq a,b} \lambda_a \lambda_i y_a y_i K_{a,i} - \sum_{i \neq a,b} \lambda_b \lambda_i y_b y_i K_{b,i} + Const \end{align} $$

Moreover, according to the dual constraints, we have

$$ \lambda_a y_a + \lambda_b y_b = -\sum_{i\neq a,b} \lambda_i y_i = - \xi\\ \lambda_b y_b = -\lambda_a y_a -\xi\\ \lambda_b = -\lambda_a y_a y_b -\xi y_b $$

So we have

$$ \begin{align} L(\lambda) =& \lambda_a -\lambda_a y_a y_b - \xi y_b - \frac{1}{2}\lambda_a^2 K_{a,a} -\frac{1}{2}(\lambda_a y_a + \xi)^2 K_{b,b} \\ & + \lambda_a y_a ( \lambda_a y_a + \xi ) K_{a,b} - \sum_{i\neq a,b} \lambda_a y_a \lambda_i y_i K_{a,i}\\ & + \sum_{i\neq a,b}(\lambda_a y_a + \xi)\lambda_i y_i K_{b,i} + Const \end{align} $$

$L(\lambda)$ is concave with respect to $\lambda_a$, since $\frac{\partial^2{L}}{\partial{\lambda_a^2}}= -( K_{a,a} + K_{b,b} - 2K_{a,b} )=-(e_a - e_b)^T \mathbf{K} (e_a - e_b) \leq 0$ due to the fact that the kernel matrix $\mathbf{K}$ is nonnegative definite (see last post An Introduction to Support Vector Machines (SVM): kernel functions ). Therefore, we can find the optimal value of $\lambda_a$ which maximizes $L(\lambda)$ by computing the gradient and set it to 0.

$$ \begin{align} \frac{\partial{L(\lambda)}}{\partial{\lambda_a}} =& 1 - y_a y_b -\lambda_a K_{a,a} - (\lambda_a y_a +\xi)y_a K_{b,b} + 2\lambda_a K_{a,b} \\ &+ y_a \xi K_{a,b} - \sum_{i\neq a,b} y_a \lambda_i y_i K_{a,i} + \sum_{i \neq a,b}y_a \lambda_i y_i K_{b,i}\\ =& 0 \end{align} $$

By solving this equation, we will get the solution for $\lambda_a^\star$:

$$ \lambda_a^{\text{new}} = \frac{ 1-y_a y_b - \xi y_a K_{b,b} + y_a \xi K_{a,b} - \sum_{i \neq a,b} y_a \lambda_i y_i K_{a,i} +\sum_{i\neq a,b}y_a \lambda_i y_i K_{b,i} }{ K_{a,a} + K_{b,b} -2K_{a,b} } $$

It is too complicated to compute the numerator since there are too many terms. In the next, we will show that we can actually compute $\lambda_a^\text{new}, \lambda_b^\text{new}$ from the old $\lambda_a^\text{old}, \lambda_b^\text{old}$.

Before updating the value of $\lambda_a, \lambda_b$, we first use the old version $\lambda$ to perform the classification on data $\mathbf{x}_ a, \mathbf{x}_ b$:

$$ \begin{align} \hat{y}_a &= \sum_{i\neq a,b}\lambda_i y_i K_{i,a} + \lambda_a^\text{old} y_a K_{a,a} + \lambda_b^\text{old} y_b K_{b,a}\\ \hat{y}_b &= \sum_{i\neq a,b}\lambda_i y_i K_{i,b} + \lambda_a^\text{old} y_a K_{a,b} + \lambda_b^\text{old} y_b K_{b,b}\\ \end{align} $$

Base on the expressions of $\hat{y}_a, \hat{y}_b$, we can have the following equation:

$$ \begin{align} &y_a[ (\hat{y}_b - y_b) - (\hat{y}_a - y_a) ]\\ = & \sum_{i\neq a,b}y_a \lambda_i y_i K_{i,b} + \lambda_a^\text{old} K_{a,b} + \lambda_b^\text{old} y_a y_b K_{b,b} + y_a b^\text{old} - y_a y_b \\ \ & - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} - \lambda_a^\text{old}K_{a,a} - \lambda_b^\text{old} y_a y_b K_{b,a} - y_a b^\text{old} +1\\ =& \sum_{i\neq a,b} y_a \lambda_i y_i K_{i,b} + \lambda_a^\text{old} K_{a,b} - \xi y_a K_{b,b} - \lambda_a^\text{old} K_{b,b}- y_a y_b \\ \ & - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} - \lambda_a^\text{old}K_{a,a} + \lambda_a^\text{old} K_{a,b} + \xi y_a K_{a,b} +1 \\ =& 1- y_a y_b - \xi y_a K_{b,b} + \xi y_a K_{a,b} - \sum_{i \neq a,b}y_a \lambda_i y_i K_{i,a} +\sum_{i\neq a,b} y_a \lambda_i y_i K_{i,b}\\ \ & -\lambda_a^\text{old}( K_{a,a} + K_{b,b} - 2K_{a,b} )\\ = & \lambda_a^{\text{new}}(K_{a,a} + K_{b,b} -2K_{a,b})-\lambda_a^\text{old}( K_{a,a} + K_{b,b} - 2K_{a,b} ) \end{align} $$

We denote prediction error $E_i= \hat{y}_i - y_i$, then we have the expression of $\lambda_a^\text{new}$:

$$ \lambda_a^\text{new} = \lambda_a^\text{old} + \frac{y_a(E_b - E_a)}{K_{a,a} +K_{b,b} - 2K_{a,b} } $$

Discussion: What if $K_{a,a} +K_{b,b} - 2K_{a,b}=0$? In this case $L(\lambda)$ is a first degree function, it’s still concave, but in this case the definition of $\lambda_a^\text{new}$ is no longer meaningful, so we just simply select another pair $(\lambda_a, \lambda_b)$ and do the computation above.

Note that the expression of the $\lambda_a^\text{new}$ is not clipped, so for simplicity we name it as $\lambda_a^\text{new, unclipped}$. It is inadequate to only compute the $\lambda_a^\text{new, unclipped}$. We need to further clip it based on the meaningful domain determined by the dual constraints. According to the dual constraints, each $\lambda_i$ actually has a box constraint. So we have:

$$ 0\leq \lambda_a \leq C\\ 0\leq \lambda_b \leq C\\ \lambda_b = -\lambda_a y_a y_b - \xi y_b $$

We know that $y_i \in \{-1, +1\}$. Based on whether $y_a = y_b$ or not, we can have the relationship between $\lambda_a$ and $\lambda_b$ with box constraints, shown in the figure below.

Relationship between $\lambda_a$ and $\lambda_b$ with box constraints.

According to the figure, we can get the lower bound $L$ and higher bound $H$ for a meaningful solution of a new $\lambda_a$:

if $y_a \neq y_b$: $$ L = \max(\xi y_b, 0)\\ H = \min(C+\xi y_b, C ) $$
if $y_a = y_b$: $$ L = \max(0, -C-\xi y_b)\\ H = \min(C, -\xi y_b) $$ Based on $L$ and $H$, we can get the clipped new $\lambda_a$:

$$ \lambda_a^\text{new, clipped} = \begin{cases} L, &\ \text{if}\ \lambda_a^\text{new, unclipped} < L \\ H, &\ \text{if}\ \lambda_a^\text{new, unclipped} > H \\ \lambda_a^\text{new, unclipped}, &\ \text{otherwise} \end{cases} $$

This $\lambda_a^\text{new, clipped}$ is the final meaningful new value of $\lambda_a$. For simplicity, in the following we use $\lambda_a^\text{new}$ to refer $\lambda_a^\text{new, clipped}$.

After getting $\lambda_a^\text{new}$, we need to compute $\lambda_b^\text{new}$:

$$ \lambda_b^\text{new} = -\lambda_a^\text{new} y_a y_b - \xi y_b $$

Now, we need to decide whether to update the value of $b^\star$. If $0<\lambda_a^\text{new}<C$, then $\mathbf{x}_ a$ is the support vector which is exactly located at the margin. Therefore, we can update $b^\text{new}$ as:

$$ \begin{align} b^\text{new} &= y_a -\sum_{i\neq a,b} \lambda_i y_i K_{i,a} - \lambda_a^\text{new} y_a K_{a,a} - \lambda_b^\text{new} y_b K_{b,a}\\ &= b^\text{old} - ( \sum_{i}\lambda_i y_i K_{i,a} + b^\text{old} - y_a ) \\ &\ \ \ + (\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a} +(\lambda_b^\text{old}-\lambda_b^\text{new}) y_b K_{b,a} \\ &= b^\text{old} - E_a + (\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a} +(\lambda_b^\text{old}-\lambda_b^\text{new}) y_b K_{b,a} \end{align} $$

Otherwise, if $0<\lambda_b^\text{new}<C$, we can update $b^\text{new}$ as:

$$ b^\text{new} = b^\text{old} - E_b + ( \lambda_a^\text{old} - \lambda_a^\text{new} )y_a K_{a,b} +( \lambda_b^\text{old} - \lambda_b^\text{old} ) y_b K_{b,b} $$

Note that if neither $0<\lambda_a^\text{new}<C$ nor $0<\lambda_b^\text{new}<C$, here we choose not to update $b$.

Now, we have finished one single iteration in SMO.

Before we summarize the algorithm of SMO, there are some updates that can improve the computation efficiency.

Computation of $\xi$: In the deduction above, we can see $\xi$ is used in computing $L,\ H$ and $\lambda_b^\text{new}$. If we compute $\xi$ using $\xi = \sum_{i\neq a,b}\lambda_i y_i$, it will be time consuming. Instead, we can use the equation $$ \xi = -\lambda_a^\text{old} y_a - \lambda_b^\text{old} y_b $$By substituting the expression of $\xi$ into the expression of $\lambda_b^\text{new}$, we have:

$$ \lambda_b^\text{new} = \lambda_b^\text{old} + ( \lambda_a^\text{old} - \lambda_a^\text{new}) y_a y_b $$

Sequential Minimal Optimization Algorithm

According to the deduction above, we can have the pseudo algorithm of the SMO.

Initialization: $\lambda_i=0$ for $i=1,\dots,n$, $b=0$, and pre-calculation of the Kernel matrix $\mathbf{K}$
Repeat:
$\ \ \ \ \ \ \$heuristically (or randomly) select a pair $\lambda_a^\text{old}\leftarrow \lambda_a,\ \lambda_b^\text{old}\leftarrow \lambda_b$;

$\ \ \ \ \ \ \$if $K_{a,a}+K_{b,b}-2K_{a,b}==0$:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$continue

$\ \ \ \ \ \ \$$E_a = \sum_{i} \lambda_i y_i K_{i,a}+ b^\text{old} - y_a$
$\ \ \ \ \ \ \$$E_b = \sum_{i}\lambda_i y_i K_{i,b}+ b^\text{old} - y_b$
$\ \ \ \ \ \ \$$\lambda_a^\text{new, unclipped} = \lambda_a^\text{old} + \frac{ y_a (E_b - E_a)}{ K_{a,a} + K_{b,b} -2K_{a,b} }$
$\ \ \ \ \ \ \$$\xi = -\lambda_a^\text{old} y_a - \lambda_b^\text{old} y_b$

$\ \ \ \ \ \ \$if $y_a \neq y_b$:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$L= \max( \xi y_b,0 ),\ H=\min(C+\xi y_b,C)$
$\ \ \ \ \ \ \$else:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$L= \max( 0, -C-\xi y_b ),\ H=\min(C, -\xi y_b)$

$\ \ \ \ \ \ \$if $\lambda_a^\text{new, unclipped} < L$:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$\lambda_a^\text{new} = L$
$\ \ \ \ \ \ \$else if $\lambda_a^\text{new, unclipped} > H$:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$\lambda_a^\text{new} = H$
$\ \ \ \ \ \ \$else:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$\lambda_a^\text{new} = \lambda_a^\text{new, unclipped}$

$\ \ \ \ \ \ \$$\lambda_b^\text{new}=\lambda_b^\text{old}+(\lambda_a^\text{old}-\lambda_a^\text{new})y_a y_b$
$\ \ \ \ \ \ \$$\lambda_a\leftarrow \lambda_a^\text{new},\ \lambda_b\leftarrow \lambda_b^\text{new}$

$\ \ \ \ \ \ \$if $0<\lambda_a^\text{new}<C$:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$b^\text{new}=b^\text{old}-E_a +(\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,a}+(\lambda_b^\text{old}-\lambda_b^\text{new})y_b K_{b,a}$
$\ \ \ \ \ \ \$else if $0<\lambda_b^\text{new}<C$:
$\ \ \ \ \ \ \$$\ \ \ \ \ \ \$$b^\text{new}=b^\text{old}-E_b +(\lambda_a^\text{old}-\lambda_a^\text{new})y_a K_{a,b}+(\lambda_b^\text{old}-\lambda_b^\text{new})y_b K_{b,b}$

Until: Maximum iteration reached, or the dual objective function $L(\lambda)$ is not further maximized with a certain accuracy.

Cool, isn’t it? Now We are able to solve the dual problem using the SMO algorithm!

Ref:

机器学习算法实践-SVM中的SMO算法- 知乎

An Introduction to Support Vector Machines (SVM): Sequential Minimal Optimization (SMO)

支持向量机(SVM)概述：SMO算法求解对偶问题

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization Algorithm

CATALOG

FEATURED TAGS

FRIENDS