CISC 484/684 – Introduction to Machine Learning
June 26, 2025
Suppose we’re predicting binary outcomes:
These are all binary classification problems.
The outcome is either 1 (success) or 0 (failure).
So how do we model this?
\(\Pr(X = 1) = \mu\)
\(\Pr(X = 0) = 1 - \mu\)
So the PMF becomes:
Very compact form: \[ \Pr(x) = \mu^x (1 - \mu)^{1 - x} \]
The Bernoulli distribution models a single binary outcome: \[ x \in \{0, 1\} \]
It has one parameter: \(\mu \in [0, 1]\), the probability of success (\(x = 1\))
The probability mass function (PMF) is: \[ \Pr(X = x) = \mu^x (1 - \mu)^{1 - x}, \quad \text{for } x \in \{0, 1\} \]
This compact form covers both cases:
This will be useful when we do MLE.
We use a parameter \(\mu \in [0, 1]\) to express the fairness of a coin.
A perfectly fair coin has \(\mu = 0.5\).
For convenience, we encode \(H\) as 1 and \(T\) as 0
Define a random variable \(X \in \{0, 1\}\)
Then: \[ \Pr(X = 1; \mu) = \mu \]
For simplicity, we write \(p(x)\) or \(p(x; \mu)\) instead of \(\Pr(X = x; \mu)\)
The parameter \(\mu \in [0, 1]\) specifies the bias of the coin
Fair coin: \(\mu = \frac{1}{2}\)
These two are closely related, but not the same.
The difference is what is fixed and what is being varied.
Concept | Probability | Likelihood |
---|---|---|
Fixed | Parameters (e.g., \(\mu\)) | Data (e.g., \(x_1, \dots, x_n\)) |
Varies | Outcome (data) | Parameters |
Usage | Sampling, prediction | Inference, estimation (MLE!) |
If I tell you:
“This is a fair coin: \(\mu = 0.5\)”
Then you ask:
“What’s the probability of getting \(HHH\)?”
We compute: \[ \Pr(HHH \mid \mu = 0.5) = (0.5)^3 = 0.125 \]
You know the model. You’re predicting data.
You say:
“I tossed a coin 3 times and got \(HHH\).”
Then you ask:
“What’s the most likely value of \(\mu\) that explains this?”
We evaluate: \[ L(\mu \mid \text{data}) = \mu^3 \]
Probability
Likelihood
Imagine you observe \(x = [1, 1, 1, 0, 1]\)
Probability: “If I assume \(\mu = 0.6\), how likely is this data?”
Likelihood: “Given this data, which \(\mu\) makes it most likely?”
Likelihood of an i.i.d. sample \(\mathbf{X} = [x_1, \dots, x_n]\): \[ L(\mu) = \Pr(\mathbf{X}; \mu) = \prod_{i=1}^n \mu^{x_i}(1 - \mu)^{1 - x_i} \]
Log-likelihood: \[ \ell(\mu) = \log L(\mu) = \sum_{i=1}^n \left[ x_i \log \mu + (1 - x_i) \log (1 - \mu) \right] \]
Because \(\log\) is monotonic: \[ \arg\max_\mu \, L(\mu) = \arg\max_\mu \, \log L(\mu) \]
Question: Why do we use the log-likelihood?
To find the MLE, we set the derivative to zero:
\[ \begin{aligned} \frac{d}{d\mu} \log L(\mu) &= \frac{1}{\mu} \sum_{i=1}^n x_i - \frac{1}{1 - \mu} \sum_{i=1}^n (1 - x_i) = 0 \\ \Rightarrow \frac{1 - \mu}{\mu} &= \frac{n - \sum x_i}{\sum x_i} \\ \Rightarrow \widehat{\mu}_{\text{ML}} &= \frac{1}{n} \sum_{i=1}^n x_i \end{aligned} \]
MLE isn’t always reliable with small samples:
If we toss a coin twice and get \(HT\), then \(\widehat{\mu}_{\text{ML}} = 0.5\)
→ Is this enough to say the coin is fair?
If we toss it only once:
We’re not just flipping biased coins anymore.
In ML, we observe:
Our goal: learn a model that predicts \[ \boxed{p(y = 1 \mid \mathbf{x})} \]
Why? Because that gives us:
Before: \(y_i \sim \text{Bernoulli}(p)\) with constant \(p\)
Now: \(y_i \sim \text{Bernoulli}(p_i)\) where \(p_i\) depends on \(\mathbf{x}_i\)
We model: \[ p(y_i = 1 \mid \mathbf{x}_i) = \sigma(\mathbf{w}^\top \mathbf{x}_i) \]
This is the core idea behind logistic regression: