AlgorithmsMachineLearningOptimization
Given i.i.d. data drawn from some distribution and the hypothesis set of linear classifiers. We begin by searching for the optimal hyperplane which has maximum margin, which geometrically maximizes robustness to errors (move a point a little bit and won’t flip it’s prediction):
where the second equality holds by scale invariance of the quotient. This gives the following constrained optimization problem
In reality, the data is not linearly separable, so one’s introduces a slack variable for each data point, in the following way:
Remove all points that make the data non-separable
Form the maximum margin hyperplane for this separable dataset
Define to be the distance from to the closest correctly classified point!
This gives the following constrained optimization problem where captures the tradeoff between penalizing non-separable points (outliers) and maximizing the margin. This is equivalent to minimization of the -regularized Hinge Loss where .