MachineLearningStatistics Given i.i.d. data of -dimensional features and a hypothesis set , we seek a function that minimizes the risk functional. The empirical risk is the sample average of the loss function where for classification 𝟙. It is obvious that . Therefore, for any fixed , by the Law of Large Numbers , with a rate of convergence given by Hoeffding’s Inequality So with probability that is exponentially small in the number of samples, the empirical risk has large deviations from the true risk. Therefore we can use Empirical Risk Minimization (ERM), , to get a consistent estimator of the minimizer of the true risk : For finite hypothesis spaces, a union bound gives the generalization bound We see that for ERM to succeed for general, infinite hypothesis spaces, one needs a sort of uniform LLN over , the sigma-algebra generated by the hypothesis set. This was established using various measures of complexity for . The common theme is the reduction of the analysis of an infinite class to an analysis of finite classes:

  1. VC Dimension - for classification i.e. 𝟙
  2. Rademacher Complexity - for regression i.e.

The two key results of the theory of ERM are

  1. Necessary and sufficient conditions on where uniform convergence is guaranteed to hold
  2. Rates of convergence