AlgorithmsMachineLearningOptimization Given i.i.d. data drawn from some distribution and the hypothesis set of linear classifiers. We begin by searching for the optimal hyperplane which has maximum margin, which geometrically maximizes robustness to errors (move a point a little bit and won’t flip it’s prediction): where the second equality holds by scale invariance of the quotient. This gives the following constrained optimization problem In reality, the data is not linearly separable, so one’s introduces a slack variable for each data point, in the following way:

  1. Remove all points that make the data non-separable
  2. Form the maximum margin hyperplane for this separable dataset
  3. Define to be the distance from to the closest correctly classified point!

This gives the following constrained optimization problem where captures the tradeoff between penalizing non-separable points (outliers) and maximizing the margin. This is equivalent to minimization of the -regularized Hinge Loss where .

So the Lagrangian for the SVM problem is