ML Foundations

CISC 484/684 – Introduction to Machine Learning

Sihan Wei

June 12, 2025

Quick Updates

Final exam:
- Time: 08/15/2025, 4:30PM - 6:30PM
- Location: SHL 105
- The last day of class will be a final review
Canvas course page published (questions?)
HW 1 will be out tonight

Recall

What is Machine Learning?

A set of methods that can automatically detect patterns in data.
Use uncovered patterns to predict future data.
Core components:
- Data
- Model / Hypothesis
- Algorithm to fit the model to data

When Should We Use ML?

Problem is well defined.
No deterministic/rule-based solution.
Have large amounts of high-quality data.
Evaluation is clear and meaningful.

Genres of Machine Learning

Supervised Learning (with labels)
- Regression: continuous labels
- Classification: discrete labels
Unsupervised Learning (without labels)
Semi-supervised Learning (partially labeled data — not covered in this course)
Reinforcement Learning (learning from interaction — not covered in this course)

Setup

Fundamental Assumption in Machine Learning

i.i.d. assumption: data are independent and identically distributed
- Independence: each data point is assumed to be unrelated to others
- Identical distribution: all data points are drawn from the same underlying probability distribution

Formal Setup

Input space: \(\mathcal{X}\) (features)
Output space: \(\mathcal{Y}\) (labels or targets)
Unknown joint distribution: \(p(\mathbf{x}, y)\) over \(\mathcal{X} \times \mathcal{Y}\)
We are given a training set of \(N\) labeled examples:
\[ (\mathbf{x}_i, y_i), \quad i = 1, \ldots, N \] where each pair is drawn i.i.d. from \(p\);
\(\mathbf{x}_i \in \mathcal{X}\), \(y_i \in \mathcal{Y}\)
Goal: Learn a function \(f: \mathcal{X} \rightarrow \mathcal{Y}\) that accurately predicts \(y\) for unseen \(\mathbf{x} \sim p\)

A Familiar Example

Suppose you’re building a system to predict student exam scores based on how many hours they studied. This is a classic example of a supervised learning problem.

Input: hours studied → \(x\)
Output: exam score → \(y\)

We collect training data in the form of:

\[ \{(x^{(1)}, y^{(1)}),\ (x^{(2)}, y^{(2)}),\ \dots,\ (x^{(n)}, y^{(n)})\} \]

Our goal is to learn a function that maps new values of \(x\) (study time) to predicted scores \(y\).

The Learning Problem

Given data:

\[ \{(x^{(i)}, y^{(i)})\}_{i=1}^n \]

we want to learn a function \(h(x)\) such that:

\[ h(x) \approx y \]

\(h(x)\): prediction made by the model
\(y\): the true label (possibly noisy)

The Role of the Hypothesis

We call the function we learn the hypothesis, \(h\):

It’s our best guess of the true function
It’s trained by minimizing a loss on data

The ultimate goal is not to memorize \(y\), but to generalize well to new (unseen) \(\mathbf{x}\).

Modeling Assumption

In supervised learning, we often assume:

\[ y = f(x) + \epsilon \]

Where:

\(f(x)\): the true underlying relationship between input \(x\) and output \(y\) (unknown and possibly unknowable)
\(y\): the observed label (often noisy)
\(\epsilon\): random noise (e.g., measurement error, unobserved variables)
\(h(x)\): our learned model — an approximation of \(f(x)\)

Common Sources of Noise

The noise term \(\epsilon\) captures uncertainty and unmodeled influences, such as:
- Measurement or sensor errors
- Missing or unobserved variables
- Inherent randomness in the environment or process

More About the Noise Term

We typically assume the noise \(\epsilon\) satisfies:
- Zero mean (white noise): \(\mathbb{E}[\epsilon \mid x] = 0\)
- Why?
Sometimes, we also assume: \[ \epsilon \sim \mathcal{N}(0, \sigma^2) \]
- Not necessary, helpful for analysis.
Without the zero-mean assumption, the model would be learning a systematically biased target.

Our Hypothesis \(h(x)\)

We don’t know \(f(x)\), but we learn a model \(h(x)\)
We assume:

\[ y = f(x) + \epsilon \quad \Rightarrow \quad \mathbb{E}[y \mid x] = f(x) \]

So we train \(h(x)\) to minimize the difference from \(y\), hoping:

\[ h(x) \approx f(x) \]

Visualization

Hypothesis Class

Motivation: Why Do We Need Hypothesis Classes?

Before we can learn a function \(h\), we must define what kind of functions we are willing to consider.
- Linear functions? Neural networks? Decision trees? Polynomials?

Motivation Cont’d

Specifying a hypothesis class means choosing a set of allowable functions.
- This encodes our assumptions about the structure of the problem.
- It also determines what the learning algorithm can possibly discover.

Definition: Hypothesis Class

A hypothesis class is the set of functions that a learning algorithm can search over.
It defines the space of possible models we allow during training.
Example (linear regression):
The hypothesis class is all linear functions of the form:
\[ f(x) = w^T x \]

Choosing a Hypothesis Class

A rich hypothesis class is highly expressive, but may overfit the training data.
A simple hypothesis class is easier to learn and generalize from, but may underfit the true pattern.
The goal is to choose a class that:
- Is expressive enough to contain a good (or optimal) solution
- Is simple enough to avoid overfitting and enable efficient learning

Hypothesis Class Tradeoff

Loss Functions

What Is a Loss Function?

A loss function quantifies how “bad” a prediction is — it measures the discrepancy between the predicted output \(h(x)\) and the true label \(y\).
Formally, it’s a function: \[ \ell(h(x), y) \in \mathbb{R}_{\geq 0} \] where smaller values indicate better predictions.

Why Do We Need It?

It gives the learning algorithm a goal: minimize the total (or average) loss over the training data.
Without a loss function, we wouldn’t have a way to compare different hypotheses \(h\).
Different tasks (regression, classification) use different loss functions tailored to the nature of the output and the goal.

0-1 Loss

The 0-1 loss measures whether a prediction is correct or not:

\[ \ell_{0-1}(h(x), y) \stackrel{\text{def}}{=} \begin{cases} 0 & \text{if } h(x) = y \\ 1 & \text{if } h(x) \neq y \end{cases} \]

Common in classification tasks
Easy to interpret, but hard to optimize directly (non-convex, non-differentiable)

Squared Loss

The squared loss measures the squared difference between the prediction and the true label:

\[ \ell_{\text{sq}}(h, (x, y)) = (h(x) - y)^2 \]

Commonly used in regression tasks
Penalizes larger errors more heavily than smaller ones
Differentiable and convex — easy to optimize with gradient-based methods

Why Use Squared Loss?

Assumes that errors (noise) are zero-mean and symmetric — often modeled as Gaussian
Leads to the mean squared error (MSE) objective: \[ \frac{1}{n} \sum_{i=1}^n (h(x^{(i)}) - y^{(i)})^2 \]
The optimal predictor under squared loss is the conditional mean: \[ h^*(x) = \mathbb{E}[y \mid x] \]

Risk

Risk: What Are We Minimizing?

Once we define a loss function \(\ell(h(x), y)\), we can define the total “badness” of a predictor \(h\).
The expected (generalization) risk is the average loss over the true data distribution \(p(x, y)\):

\[ R(h) = \mathbb{E}_{(x, y) \sim p}[\ell(h(x), y)] \]

This is what we really care about: how well does \(h\) perform on future, unseen data?

Empirical Risk

But we don’t know the true distribution \(p(x, y)\)!
Instead, we minimize the empirical risk — the average loss over the training data:

\[ \hat{R}(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x^{(i)}), y^{(i)}) \]

This is computable — and what training algorithms actually minimize.
Empirical Risk Minimization (ERM): \[ h^* = \arg\min_h \hat{R}(h) \]

Why This Matters

Good performance on training data (\(\hat{R}(h)\)) does not guarantee good performance on test data (\(R(h)\)).
The gap between \(R(h)\) and \(\hat{R}(h)\) is called the generalization gap.
Goal: design learning algorithms and hypothesis classes so that this gap is small.

Generalization: The Ultimate Goal

A good model should not just perform well on training data, but also on unseen data.
Generalization = small gap between:
- Empirical risk \(\hat{R}(h)\) (on training set)
- True risk \(R(h)\) (on the data distribution)
Generalization is what separates learning from memorization.

Hoeffding’s Inequality

Let \(h\) be a fixed hypothesis and assume: - Loss \(\ell(h(x), y) \in [0, 1]\) - \((x^{(i)}, y^{(i)})\) are drawn i.i.d. from distribution \(p(x, y)\)

Then, Hoeffding’s inequality says:

\[ \Pr\left( \left| R(h) - \hat{R}(h) \right| \geq \gamma \right) \leq 2 \exp(-2n\gamma^2) \]

\(R(h)\) = true risk
\(\hat{R}(h)\) = empirical risk
\(n\) = number of training examples
\(\gamma\) = generalization gap

Interpretation

With high probability, the empirical risk is close to the true risk.
As \(n\) increases, the probability of a large generalization gap shrinks exponentially.
This holds for a fixed hypothesis \(h\).

But what if we choose \(h\) after seeing the data? → Let’s talk about uniform convergence next.

Uniform Convergence

To guarantee generalization for all hypotheses in a finite class \(\mathcal{H}\), we use the union bound:

\[ \Pr\left( \exists h \in \mathcal{H}: |R(h) - \hat{R}(h)| \geq \gamma \right) \leq 2 |\mathcal{H}| \cdot \exp(-2n\gamma^2) \]

The probability that any hypothesis in \(\mathcal{H}\) has large error is bounded.
As \(|\mathcal{H}|\) increases, the bound gets looser.

Underfitting vs Overfitting

Underfitting:
- Model is too simple to capture patterns in the data
- High training error and high test error
Overfitting:
- Model is too complex, learns noise in the training data
- Low training error, but high test error
Goal: Find a model that generalizes well

Bad Example: The “Memorizer”

Consider a function \(h(\cdot)\) that simply memorizes the training data \(D\):

\[ h(x) = \begin{cases} y_i, & \text{if } \exists (x_i, y_i) \in D \text{ such that } x = x_i \\ 0, & \text{otherwise} \end{cases} \]

This \(h(\cdot)\) achieves 0% training error
But it performs poorly on unseen data
It doesn’t generalize — it overfits by memorizing specific examples

Visualizing the Tradeoff

Bias vs Variance

Where Does Error Come From?

We can break the expected squared error at a point \(x\) into three components:

\[ \begin{split} \mathbb{E}_{\mathcal{D}}[(h_{\mathcal{D}}(x) - y)^2] &= \underbrace{(\mathbb{E}_{\mathcal{D}}[h_{\mathcal{D}}(x)] - f(x))^2}_{\text{Bias}^2} \\ &+ \underbrace{\mathbb{E}_{\mathcal{D}}\left[(h_{\mathcal{D}}(x) - \mathbb{E}_{\mathcal{D}}[h_{\mathcal{D}}(x)])^2\right]}_{\text{Variance}} \\ &+ \underbrace{\mathbb{E}[(y - f(x))^2]}_{\text{Noise}} \end{split} \]

\(h_{\mathcal{D}}(x)\) is the model trained on dataset \(\mathcal{D}\)
\(f(x)\) is the true function
\(y = f(x) + \epsilon\), and \(\epsilon\) is zero-mean noise

Bias–Variance Tradeoff

Bias: Error from wrong assumptions (e.g., model too simple)
- High bias ⇒ underfitting
Variance: Error from sensitivity to training data
- High variance ⇒ overfitting
Noise: Irreducible randomness in the data-generating process
We want to find the sweet spot between bias and variance.

Model Complexity and Error

As model complexity increases:
- Bias decreases: the model can fit the true function better
- Variance increases: the model reacts more to noise in the training data
The goal is to minimize the total expected error: \[ \text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Noise} \]

This is why regularization, cross-validation, and careful model selection matter!

Bias-variance tradeoff

Regularization

What Is Regularization?

Instead of minimizing empirical risk alone: \[ \hat{R}(h) = \frac{1}{n} \sum_{i=1}^n \ell(h(x^{(i)}), y^{(i)}) \]
We minimize the regularized objective: \[ \hat{R}_{\text{reg}}(h) = \hat{R}(h) + \lambda \cdot \Omega(h) \]
- \(\Omega(h)\): complexity penalty
- \(\lambda\): regularization strength (hyperparameter)

Why Regularization Helps

Controls model complexity without needing to change the hypothesis class
Helps prevent overfitting (high variance)
May introduce bias, but reduces variance — often leads to better generalization
Think of it as “tugging the model toward simplicity”

Common Regularization Techniques

L2 regularization (Ridge): \[ \Omega(h) = \|w\|_2^2 \quad \text{→ encourages small weights} \]
L1 regularization (Lasso): \[ \Omega(h) = \|w\|_1 \quad \text{→ encourages sparsity (some weights = 0)} \]
Can apply to:
- Linear regression and logistic regression
- Neural networks
- Other parameterized models

Regularization Tradeoff

Small \(\lambda\):
- Weak regularization → model may overfit
Large \(\lambda\):
- Strong regularization → model may underfit
Choose \(\lambda\) using validation data or cross-validation

Visualization

L1 vs L2 Regularization: Comparison

Feature	L1 (Lasso)	L2 (Ridge)
Penalty term	\(\lambda \\|w\\|_1 = \lambda \sum \|w_i\|\)	\(\lambda \\|w\\|_2^2 = \lambda \sum w_i^2\)
Geometry	Diamond-shaped contours	Circular contours
Solution behavior	Sparse: some \(w_i = 0\)	Shrinks all weights smoothly
Feature selection	Yes — performs automatic selection	No — retains all features
Optimization	Non-differentiable at 0 (but convex)	Differentiable and convex

L1 vs L2 Cont’d

L1: encourages sparsity → good for interpretability and feature selection
L2: discourages large weights → good for generalization and numerical stability