CISC 484/684 – Introduction to Machine Learning
June 17, 2025
Appears in Homework — please review!
Understand the error decomposition:
\[ \mathbb{E}[(\hat{y} - y)^2] = \text{Bias}^2 + \text{Variance} + \text{Noise} \]
No derivation required, but be sure you know why the cross-term is zero.
We want to predict score from hours studied:
Find a rule of the form:
\(\hat{y} = w x + b\)
Let’s eyeball a possible line…
We guessed:
\(\hat{y} = 6x + 35\)
We want to predict exam score based on hours studied.
Let’s try a linear model:
\[ \hat{y} = wx + b \]
This is the linear regression model — our job is to find the best values of \(w\) and \(b\) from data.
We assume a linear relationship between input \(x\) and output \(y\):
\[ \hat{y} = wx + b \]
In this model:
Also known as least squares regression.
More generally, for multidimensional input \(\mathbf{x} \in \mathbb{R}^d\):
\[ \hat{y} = \mathbf{w}^\top \mathbf{x} + b \]
Where:
We aim to learn \(w\) and \(b\) from data such that \(\hat{y}\) closely approximates \(y\).
Our model predicts \(\hat{y}\), but real \(y\) may differ.
Example:
Hours (\(x\)) | True (\(y\)) | Predicted (\(\hat{y}\)) | Error (\(\hat{y} - y\)) |
---|---|---|---|
4 | 55 | 60 | \(+5\) |
6 | 70 | 71 | \(+1\) |
8 | 85 | 83 | \(-2\) |
Let’s compare two candidate lines:
Both are straight. But are they equally good?
Which line better explains the observed data?
Both lines are straight.
Both make some predictions.
But one line’s predictions are consistently closer to the actual data.
We need a way to quantify the total error between predicted and observed values.
That’s where loss functions come in.
\[ \ell_{\text{sq}}(\hat{y}, y) = (\hat{y} - y)^2 \]
Given \(n\) data points \((x_i, y_i)\):
\[ \text{MSE}(w, b) = \frac{1}{n} \sum_{i=1}^{n} \left( \hat{y}_i - y_i \right)^2 = \frac{1}{n} \sum_{i=1}^{n} \left( w^\top x_i + b - y_i \right)^2 \]
We’ll soon see: this comes from a probabilistic assumption.
So far, we’ve assumed a deterministic prediction:
\[ \hat{y} = wx + b \]
But in reality, outputs contain random noise — exam scores can vary even with the same study hours.
We model this uncertainty as:
\[ y = \hat{y} + \varepsilon = wx + b + \varepsilon \]
Assume:
\[ \varepsilon \sim \mathcal{N}(0, \sigma^2) \]
Then the output \(y\) is normally distributed around the predicted value:
\[ y \sim \mathcal{N}(wx + b, \sigma^2) \]
A Gaussian (normal) distribution looks like a bell curve:
\[ \mathcal{N}(\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y - \mu)^2}{2\sigma^2} \right) \]
Most real-world measurements (e.g., test scores, noise) are approximately Gaussian!
Image source: Wikipedia
Given a dataset \(\{(x_i, y_i)\}_{i=1}^n\), the likelihood of the parameters \((w, b)\) is:
\[ \mathcal{L}(w, b) = \prod_{i=1}^n P(y_i \mid x_i; w, b) \]
Each \(y_i\) is assumed drawn from:
\[ y_i \sim \mathcal{N}(wx_i + b, \sigma^2) \]
So:
\[ P(y_i \mid x_i; w, b) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_i - wx_i - b)^2}{2\sigma^2} \right) \]
Taking logs:
\[ \log \mathcal{L}(w, b) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - wx_i - b)^2 + \text{const} \]
Maximizing log-likelihood is the same as:
\[ \arg\max_{w,b} \log \mathcal{L}(w, b) \quad \Longleftrightarrow \quad \arg\min_{w,b} \sum_{i=1}^n (y_i - wx_i - b)^2 \]
This is mean squared error (MSE) minimization.
So, when we assume Gaussian noise, linear regression with MSE loss is performing maximum likelihood estimation (MLE)!
We now move to multivariate linear regression:
\[ \hat{y} = \mathbf{w}^\top \mathbf{x} + b \]
To simplify notation, we fold the bias into the weights:
\[ \tilde{\mathbf{x}} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \\ 1 \end{bmatrix} \quad \tilde{\mathbf{w}} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \\ b \end{bmatrix} \]
Then:
\[ \hat{y} = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}} \]
Now suppose we have \(n\) examples.
Let:
Computational Efficiency: Matrix operations can be highly optimized by modern CPUs and GPUs, taking advantage of vectorized operations and parallel processing.
Simplicity: Using matrix formulations makes your code closely match their math formulation. You can write more readable code.
Scalability: Matrix operations allow for more scalable and flexible algorithm implementations.
Our goal is to minimize Mean Squared Error:
\[ \mathcal{L}(\tilde{\mathbf{w}}) = \frac{1}{2}\| X \tilde{\mathbf{w}} - \mathbf{y} \|^2 = \frac{1}{2}(X \tilde{\mathbf{w}} - \mathbf{y})^\top (X \tilde{\mathbf{w}} - \mathbf{y}) \]
This is a convex quadratic function of \(\tilde{\mathbf{w}}\).
To find the optimal \(\tilde{\mathbf{w}}\), take the gradient:
\[ \nabla_{\tilde{\mathbf{w}}} \mathcal{L} = X^\top (X \tilde{\mathbf{w}} - \mathbf{y}) \]
Set the gradient to zero:
\[ X^\top X \tilde{\mathbf{w}} = X^\top \mathbf{y} \]
From:
\[ X^\top X \tilde{\mathbf{w}} = X^\top \mathbf{y} \]
Assuming \(X^\top X\) is invertible:
\[ \tilde{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y} \]
This is the maximum likelihood estimator under Gaussian noise,
and also the solution to the least squares problem.
Add an \(\ell_2\) penalty to the loss:
\[ \mathcal{L}_{\text{ridge}}(\mathbf{w}) = \sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 + \lambda \|\mathbf{w}\|_2^2 \]
\[ \mathbf{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y} \]
Always invertible if \(\lambda > 0\)
Use \(\ell_1\) penalty instead:
\[ \mathcal{L}_{\text{lasso}}(\mathbf{w}) = \sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 + \lambda \|\mathbf{w}\|_1 \]
Method | Penalty | Effect |
---|---|---|
OLS | None | Minimum training error |
Ridge | \(\ell_2\) | Shrinks all weights |
Lasso | \(\ell_1\) | Shrinks and sparsifies |
Image source: “Statistics, Data Mining, and Machine Learning in Astronomy” (2013)
We’ll compare three fits on the Hours Studied → Exam Score data:
Notice how Ridge smooths (shrinks weights) and Lasso can push weights toward zero.