Why Supervised Learning Works?

Why Supervised Learning Works?

Machine learning algorithms, such as logistic regression and XGBoost, are widely used in various industries. Practitioners typically focus on whether a technique is effective without a strong interest in understanding the reasons behind its success or failure. They may adopt a brute-force approach to identify the best features and algorithms for their out-of-time validation. In this blog, I will explain why supervised learning works. The statistical learning theory is not my original idea, but I hope that by sharing it with a wider audience, I can contribute to the distribution of knowledge.

Generalization Error Bound in Supervised Learning

In essence, "Low Training Error + Sufficient Data Relative to Model Complexity → Low Test Error".

Supervised learning must work on the SAME distribution.

Key Concepts

  • Training Error (\(\text{Train}_S(f)\)): The error of a model \(f\) evaluated on the training dataset \(S\).
  • Test Error (\(\text{Test}_D(f)\)): The expected error of \(f\) on new, unseen data drawn from the same distribution \(D\).
  • Hypothesis Class (\(\mathcal{F}\)): The set of all models \(f\) that the learning algorithm can choose from.
  • Sample Size (\(|S|\)): The number of training examples.
  • Confidence Level (\(1 - \delta\)): The probability that the bound holds true.
  • Degrees of Freedom: Informally, the number of parameters or complexity of the model class \(\mathcal{F}\).

Mathematical Formulation

Generalization Error Bound

The generalization error bound is given by:

\[ \Pr_{S \sim D^{|S|}} \left[ \text{Test}_D(f) - \text{Train}_S(f) \leq \sqrt{\frac{\log |\mathcal{F}| + \log \frac{1}{\delta}}{|S|}} \quad \text{for all} \quad f \in \mathcal{F} \right] > 1 - \delta \]

Interpretation:

With probability at least \(1 - \delta\), the difference between test and training error is bounded by:

\[ \sqrt{\frac{\log |\mathcal{F}| + \log \frac{1}{\delta}}{|S|}} \]

Implications:

  1. Sample Size Effect
    • Larger training sample size \(|S|\) results in:
      • Tighter error bounds
      • Test error closer to training error
  2. Model Complexity Effect
    • Larger hypothesis class size \(|\mathcal{F}|\) leads to:
      • Looser error bounds
      • Potentially larger gap between test and training errors

Proof

The proof follows these key steps:

  1. Setup
    • Let \(f \in \mathcal{F}\) be any hypothesis
    • Let \(S\) be a training set of size \(|S|\) drawn i.i.d. from distribution \(D\)
    • Let \(\text{Test}_D(f)\) and \(\text{Train}_S(f)\) be the test and training errors
  2. Hoeffding's Inequality

    For a single hypothesis \(f\):

    \[ \Pr_S[|\text{Test}_D(f) - \text{Train}_S(f)| > \epsilon] \leq 2e^{-2|S|\epsilon^2} \]

    Note on Hoeffding's Inequality: In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of independent random variables deviates from its expected value.

  3. Union Bound

    Apply to all hypotheses \(f \in \mathcal{F}\):

    \[ \Pr_S[\exists f \in \mathcal{F}: |\text{Test}_D(f) - \text{Train}_S(f)| > \epsilon] \leq |\mathcal{F}| \cdot 2e^{-2|S|\epsilon^2} \]

  4. Set Failure Probability

    Let right side equal \(\delta\):

    \[ |\mathcal{F}| \cdot 2e^{-2|S|\epsilon^2} = \delta \]

  5. Solve for \(\epsilon\)

    \[ \begin{align*} e^{-2|S|\epsilon^2} &= \frac{\delta}{2|\mathcal{F}|} \\ -2|S|\epsilon^2 &= \ln(\frac{\delta}{2|\mathcal{F}|}) \\ \epsilon^2 &= \frac{\ln(2|\mathcal{F}|/\delta)}{2|S|} \\ \epsilon &= \sqrt{\frac{\ln(2|\mathcal{F}|/\delta)}{2|S|}} \end{align*} \]

  6. Final Result

    Since \(\Pr_S[\exists f \in \mathcal{F}: |\text{Test}_D(f) - \text{Train}_S(f)| > \epsilon] \leq \delta\), it follows that the probability that all hypotheses satisfy the inequality is at least \(1-\delta\): \(\Pr_S[\forall f \in \mathcal{F}: |\text{Test}_D(f) - \text{Train}_S(f)| \leq \epsilon] \geq 1 - \delta\).

    Thus, with probability at least \(1-\delta\), the difference between test and training error is bounded by:

    \[ |\text{Test}_D(f) - \text{Train}_S(f)| \leq \sqrt{\frac{\log |\mathcal{F}| + \log \frac{1}{\delta}}{|S|}} \]

This proves that with high probability, the difference between test and training error is bounded by a term that depends on the model complexity (\(|\mathcal{F}|\)) and training set size (\(|S|\)).

In other words, supervised learning must work on the SAME distribution for these bounds to hold.