CISC 484/684 – Introduction to Machine Learning
June 12, 2025
A set of methods that can automatically detect patterns in data.
Use uncovered patterns to predict future data.
Core components:
Suppose you’re building a system to predict student exam scores based on how many hours they studied. This is a classic example of a supervised learning problem.
We collect training data in the form of:
\[ \{(x^{(1)}, y^{(1)}),\ (x^{(2)}, y^{(2)}),\ \dots,\ (x^{(n)}, y^{(n)})\} \]
Our goal is to learn a function that maps new values of \(x\) (study time) to predicted scores \(y\).
Given data:
\[ \{(x^{(i)}, y^{(i)})\}_{i=1}^n \]
we want to learn a function \(h(x)\) such that:
\[ h(x) \approx y \]
We call the function we learn the hypothesis, \(h\):
The ultimate goal is not to memorize \(y\), but to generalize well to new (unseen) \(\mathbf{x}\).
In supervised learning, we often assume:
\[ y = f(x) + \epsilon \]
Where:
\[ y = f(x) + \epsilon \quad \Rightarrow \quad \mathbb{E}[y \mid x] = f(x) \]
\[ h(x) \approx f(x) \]
A hypothesis class is the set of functions that a learning algorithm can search over.
It defines the space of possible models we allow during training.
Example (linear regression):
The hypothesis class is all linear functions of the form:
\[
f(x) = w^T x
\]
A rich hypothesis class is highly expressive, but may overfit the training data.
A simple hypothesis class is easier to learn and generalize from, but may underfit the true pattern.
The goal is to choose a class that:
A loss function quantifies how “bad” a prediction is — it measures the discrepancy between the predicted output \(h(x)\) and the true label \(y\).
Formally, it’s a function: \[ \ell(h(x), y) \in \mathbb{R}_{\geq 0} \] where smaller values indicate better predictions.
It gives the learning algorithm a goal: minimize the total (or average) loss over the training data.
Without a loss function, we wouldn’t have a way to compare different hypotheses \(h\).
Different tasks (regression, classification) use different loss functions tailored to the nature of the output and the goal.
The 0-1 loss measures whether a prediction is correct or not:
\[ \ell_{0-1}(h(x), y) \stackrel{\text{def}}{=} \begin{cases} 0 & \text{if } h(x) = y \\ 1 & \text{if } h(x) \neq y \end{cases} \]
The squared loss measures the squared difference between the prediction and the true label:
\[ \ell_{\text{sq}}(h, (x, y)) = (h(x) - y)^2 \]
Assumes that errors (noise) are zero-mean and symmetric — often modeled as Gaussian
Leads to the mean squared error (MSE) objective: \[ \frac{1}{n} \sum_{i=1}^n (h(x^{(i)}) - y^{(i)})^2 \]
The optimal predictor under squared loss is the conditional mean: \[ h^*(x) = \mathbb{E}[y \mid x] \]
Once we define a loss function \(\ell(h(x), y)\), we can define the total “badness” of a predictor \(h\).
The expected (generalization) risk is the average loss over the true data distribution \(p(x, y)\):
\[ R(h) = \mathbb{E}_{(x, y) \sim p}[\ell(h(x), y)] \]
But we don’t know the true distribution \(p(x, y)\)!
Instead, we minimize the empirical risk — the average loss over the training data:
\[ \hat{R}(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x^{(i)}), y^{(i)}) \]
This is computable — and what training algorithms actually minimize.
Empirical Risk Minimization (ERM): \[ h^* = \arg\min_h \hat{R}(h) \]
Good performance on training data (\(\hat{R}(h)\)) does not guarantee good performance on test data (\(R(h)\)).
The gap between \(R(h)\) and \(\hat{R}(h)\) is called the generalization gap.
Goal: design learning algorithms and hypothesis classes so that this gap is small.
A good model should not just perform well on training data, but also on unseen data.
Generalization = small gap between:
Generalization is what separates learning from memorization.
Let \(h\) be a fixed hypothesis and assume: - Loss \(\ell(h(x), y) \in [0, 1]\) - \((x^{(i)}, y^{(i)})\) are drawn i.i.d. from distribution \(p(x, y)\)
Then, Hoeffding’s inequality says:
\[ \Pr\left( \left| R(h) - \hat{R}(h) \right| \geq \gamma \right) \leq 2 \exp(-2n\gamma^2) \]
But what if we choose \(h\) after seeing the data? → Let’s talk about uniform convergence next.
To guarantee generalization for all hypotheses in a finite class \(\mathcal{H}\), we use the union bound:
\[ \Pr\left( \exists h \in \mathcal{H}: |R(h) - \hat{R}(h)| \geq \gamma \right) \leq 2 |\mathcal{H}| \cdot \exp(-2n\gamma^2) \]
Consider a function \(h(\cdot)\) that simply memorizes the training data \(D\):
\[ h(x) = \begin{cases} y_i, & \text{if } \exists (x_i, y_i) \in D \text{ such that } x = x_i \\ 0, & \text{otherwise} \end{cases} \]
We can break the expected squared error at a point \(x\) into three components:
\[ \begin{split} \mathbb{E}_{\mathcal{D}}[(h_{\mathcal{D}}(x) - y)^2] &= \underbrace{(\mathbb{E}_{\mathcal{D}}[h_{\mathcal{D}}(x)] - f(x))^2}_{\text{Bias}^2} \\ &+ \underbrace{\mathbb{E}_{\mathcal{D}}\left[(h_{\mathcal{D}}(x) - \mathbb{E}_{\mathcal{D}}[h_{\mathcal{D}}(x)])^2\right]}_{\text{Variance}} \\ &+ \underbrace{\mathbb{E}[(y - f(x))^2]}_{\text{Noise}} \end{split} \]
Bias: Error from wrong assumptions (e.g., model too simple)
Variance: Error from sensitivity to training data
Noise: Irreducible randomness in the data-generating process
We want to find the sweet spot between bias and variance.
This is why regularization, cross-validation, and careful model selection matter!
Image source: https://scott.fortmann-roe.com/docs/BiasVariance.html
Image source: Bias-variance tradeoff (Wikipeida)
Instead of minimizing empirical risk alone: \[ \hat{R}(h) = \frac{1}{n} \sum_{i=1}^n \ell(h(x^{(i)}), y^{(i)}) \]
We minimize the regularized objective: \[ \hat{R}_{\text{reg}}(h) = \hat{R}(h) + \lambda \cdot \Omega(h) \]
Controls model complexity without needing to change the hypothesis class
Helps prevent overfitting (high variance)
May introduce bias, but reduces variance — often leads to better generalization
Think of it as “tugging the model toward simplicity”
L2 regularization (Ridge): \[ \Omega(h) = \|w\|_2^2 \quad \text{→ encourages small weights} \]
L1 regularization (Lasso): \[ \Omega(h) = \|w\|_1 \quad \text{→ encourages sparsity (some weights = 0)} \]
Can apply to:
Feature | L1 (Lasso) | L2 (Ridge) |
---|---|---|
Penalty term | \(\lambda \|w\|_1 = \lambda \sum |w_i|\) | \(\lambda \|w\|_2^2 = \lambda \sum w_i^2\) |
Geometry | Diamond-shaped contours | Circular contours |
Solution behavior | Sparse: some \(w_i = 0\) | Shrinks all weights smoothly |
Feature selection | Yes — performs automatic selection | No — retains all features |
Optimization | Non-differentiable at 0 (but convex) | Differentiable and convex |