Construction of linearly separable SVM
This note is about the inference of the two class classification svm.
Suppose we have a set of data \(\mathbb{D} = \{(\boldsymbol{x}_i, y_i)\}_{i = 1}^m\), where \(\boldsymbol{x}_i \in \mathbb{R}^n\) is the feature vector and \(y_i \in \{-1, +1\}\) is the label, the classifier with hypothesis \(h_{\boldsymbol{\omega}, b}\) is like:
where, \(g(z) = 1\) if \(z \ge 0\), and \(g(z) = -1\) if \(z \lt 0\).
As so, we can image \(\boldsymbol{\omega}^T \boldsymbol{x} + b = 0\) as a hyperplane in \(\mathbb{R}^n\) that split apart the two classes elements into different half spaces as in an affine space.
Based on the above notation, we define the functional margin w.r.t each element in \(\mathbb{D}\) as
Then, if \(y^{(i)} = 1\), larger value of \(\boldsymbol{\omega}^T \boldsymbol{x}^{(i)} + b\) will make a larger \(\hat{\gamma}^{(i)}\), similar for the case where \(y^{(i)} = -1\).
And with the definition of hypothesis \(h_{\boldsymbol{\omega}, b}\), scaling \(\boldsymbol{\omega}\) and \(b\) will not affect the result of the output of the hypothesis, it will only affect the value of \(\hat{\gamma}^{(i)}\).
After that, we define the functional margin w.r.t the whole \(\mathbb{D}\) as
And the geometric margin \(\gamma^{(i)}\) w.r.t each element in \(\mathbb{D}\) is defined as the distance of each element apart from the hyper plane \(\boldsymbol{\omega}^T \boldsymbol{x} + b = 0\). We can see that the projection of the element \(\boldsymbol{x}^{(i)}\) on the hyper plane equals \(\boldsymbol{x}^{(i)} - \gamma^{(i)}\frac{\boldsymbol{\omega}}{||\boldsymbol{\omega}||}\), and since this point is on the hyper plane, the following equation holds:
And we can get
Taking the case where \(y^{(i)} = -1\) into account, we get
If \(||\boldsymbol{\omega}|| = 1\), then the functional margin equals the geometric margin.
And the geometric margin w.r.t the whole \(\mathbb{D}\) is
With the definition of the geometric margin, we can imagine the optimal classification boundary is the hyper plane that gains the maximum geometric margin w.r.t the whole \(\mathbb{D}\):
which can be transfromed to
where the objective target is changed to minimize the functional margin as \(\gamma = \frac{\hat{\gamma}}{||\boldsymbol{\omega}||}\). And it can be further transformed as
This is because by scaling \(\boldsymbol{\omega}\) and \(b\), we can get \(\hat{\boldsymbol{\omega}}\) with any magnitude. So by setting \(\hat{\gamma} = 1\), we get the above convex quadratic optimization problem with linear constraints.
Lagrange duality
Consider a problem of the following form:
The Lagrangian is defined to be
where \(\beta_i\)’s are called the Lagrange multipliers. And we can solve for \(\boldsymbol{\omega}\) and \(\boldsymbol{\beta}\) with
The primal optimization problem is defined as
And the generlized Lagrangian is defined as
where \(\boldsymbol{\alpha}\)’s and \(\boldsymbol{\beta}\)’s are the Lagrange multipliers. Consider the quantity
where \(\mathcal{P}\) stands for “primal”. We can see that
The primal problem can then be expressed as
The dual problem is like:
And we can see that the optimal value of the dual problem has the relationship with the optimal value of the primal problem as follows:
Suppose \(f\) and \(g_i\)’s are convex, and \(h_i\)’s are affine. Suppose the constraints \(g_i\)’s are (strictly) feasible; \(\exists\) \(\boldsymbol{\omega}\) such that \(g_i(\boldsymbol{\omega}) \lt 0\) for all \(i\). Then there must exists \(\boldsymbol{\omega}^*\), \(\boldsymbol{\alpha}^*\), \(\boldsymbol{\beta}^*\) so that \(\boldsymbol{\omega}^*\) is the solution to the primal problem, \(\boldsymbol{\alpha}^*\) and \(\boldsymbol{\beta}^*\) are the solution to the dual problem. And \(p^* = d^* = \mathcal{L}(\boldsymbol{\omega}^*, \boldsymbol{\alpha}^*, \boldsymbol{\beta}^*)\). And \(\boldsymbol{\omega}^*\), \(\boldsymbol{\alpha}^*\) and \(\boldsymbol{\beta}^*\) satisfy the Karush-Kuhn-Tucker (KKT) conditions, which are follows:
And any \(\boldsymbol{\omega}^*\), \(\boldsymbol{\alpha}^*\) and \(\boldsymbol{\beta}^*\) satisfy the KKT conditions is the solution to the primal and dual problems.
And the third constraint of the KKT conditions is very interesting.
Solving SVM with Lagrange dual
For
the Lagrangian is like
NOTE:
In the final result, only those support vector \(\boldsymbol{x}^{(i)}\)’s corresponding \(\alpha_i\)’s are
non-zero.
To solve the dual form of the problem, first minimize \(\mathcal{L}(\boldsymbol{\omega}, b, \boldsymbol{\alpha})\) with respect to \(\boldsymbol{\omega}\) and \(b\) (for fixed \(\boldsymbol{\alpha}\)), to get \(\theta_\mathcal{D}\), which is achieved by setting the derivatives of \(\mathcal{L}\) with respect to \(\boldsymbol{\omega}\) and \(b\) to zero:
which implies that
As for derivative with respect to \(b\), we obtain
And we substitute those back to the dual problem, and get the following:
Finally we obtain the dual optimization problem:
After solving \(\boldsymbol{\omega}^*\), the intercept term \(b\) can be calculated as
And \(\boldsymbol{\omega}^T \boldsymbol{x} + b\) can be calculated as
Kernel trick
The kernel is defined as
where \(\phi(\boldsymbol{\omega})\) can be very complicated, while \(K(\boldsymbol{x}, \boldsymbol{z})\) is computationally cheap. With the kernel trick, we can map the elements into a higher dimensional space with little extra computation by avoiding explicit computing of \(\phi(\boldsymbol{\omega})\).
A valid kernel matrix should be symmetric and semi-definite, vice versa.
Non-separable case
With \(\mathit{l}_1\) regularization, we have
Now the elements are permitted to have (functional) margin less than \(1\), and if an example has functional margin \(1 - \xi_i\), there will a cost of the objective function being increased by \(C\xi_i\). The parameter \(C\) controls the relative weighting between the twin goals of making the \(||\boldsymbol{\omega}||^2\) small and of ensuring that most elements have functional margin at least \(1\).
The dual form is like:
And the KKT dual-complementarity conditions are:
Calculation for \(b^*\) mentioned above is not valid here any more.
The SMO Algorithm
Before talking about the SMO (sequential minimal optimization) algorithm, first introduce the coordinate ascent.
For an unconstrained optimization problem
do following:
While in the problem of svm:
We can’t update single \(\boldsymbol{\alpha}\) at each step because
\(\alpha_i\) is determined by fixing other \(\alpha_j\)’s. Thus the smo chooses to update two \(\alpha_i\)’s in each step, such that the constraints will not be violated. Now we have
which gives us
And the objective function turns out to be
And \(\alpha_j\) is updated with
After finding the new value of \(\alpha_j\), we can then get the new value of \(\alpha_i\) with equation \(\alpha_i = (\zeta - \alpha_j y^{(j)})y^{(i)}\).
Recent method for Hard Margin Linear SVM
The loss function for each element in hard margin linear svm is defined as the hinge loss:
And the sub-gradient is like
And the objective function is to minimize the total loss
which is a convex linear problem, thus can be easily solved by SGD or L-BFGS.