CISC 484/684 – Introduction to Machine Learning
July 3, 2025
So far
\[ \left\{\left(x_i, y_i\right)\right\}_{i=1}^n \quad \text { where } x_i \in \mathbb{R}^d, y_i \in\{-1,+1\} \]
\[ \hat{y}=\operatorname{sign}(w^\top x) \]
There are many lines that separates the data. Which one is the BEST?
What about classifier (c)
In the case of a line in the plane given by the equation \(a x+b y+c=0\), where \(a, b\) and \(c\) are real constants with \(a\) and \(b\) not both zero, the distance from the line to a point \(\left(x_0, y_0\right)\) is:
\[ \operatorname{distance}\left(a x+b y+c=0,\left(x_0, y_0\right)\right)=\frac{\left|a x_0+b y_0+c\right|}{\sqrt{a^2+b^2}} . \]
For a random point \(x_i\), it’s distance to the decision boundary
\[ r_i=\frac{\left|\vec{w} \cdot \vec{x}_i+b\right|}{\|\vec{w}\|} . \]
We introduce \(y\) :
\[ \begin{aligned} & y=+1 \text { for }+ \text { samples } \\ & y=-1 \text { for }- \text { samples } \\ & \gamma_i=y_i\left(\frac{\vec{w} \cdot \vec{x}_i+b}{\|\vec{w}\|}\right) \end{aligned} \]
Let’s say each sample has distance at least \(\gamma\), i.e. We constrain \(\gamma_i \geqslant \gamma \forall i\).
And we want to maximize \(\gamma\), i.e. we want points closest to the decision boundary to have large margin.
Objective.
\(\max \gamma\)
s.t. \(\quad \frac{y_i\left(\vec{w} \cdot \vec{x}_i+b\right)}{\|\vec{w}\|} \geqslant \gamma\)
Objective
\(\max \frac{\hat{\gamma}}{\|\vec{w}\|}\)
s.t. \(y_i\left(\vec{w} \cdot \vec{x}_i+b\right) \geqslant \hat{\gamma}\)
\[ \begin{array}{lll} \max \frac{1}{\|\vec{w}\|} & \Leftrightarrow & \min \frac{1}{2}\|\vec{w}\|^2 \\ \text { s.t. } y_i\left(\vec{w} \cdot \vec{x}_i+b\right) \geqslant 1 & \quad & \text{ s.t. } y_i\left(\vec{w} \cdot \vec{x}_i+b\right) \geqslant 1 \end{array} \]
\[ \ell_{\text{log}}=\log(1+\exp(-y(w^\top x))) \]
\[ \ell_{\text{hinge}} = \max\{0,1-y(w^\top x)\} \]
\[ \frac{1}{n} \sum_{i=1}^n\left(1-y_i\left[w \cdot x_i\right]\right)^{+}+\lambda \frac{1}{2}\|w\|^2 \]
\[ \frac{1}{n} \sum_{i=1}^n \underbrace{-\log g\left(y_i\left[w \cdot x_i\right]\right)}_{-P\left(y_i \mid x_i, w\right)}+\lambda \frac{1}{2}\|w\|^2 \]
Problem
\[ \begin{gathered} \operatorname{minf}_{\mathbf{w}}(\mathbf{w}) \\ \operatorname{s.t.} g_i(\mathbf{w}) \leq 0, i=1, \ldots, n \\ h_j(\mathbf{w})=0, j=1, \ldots, l \end{gathered} \]
Define
\[ \mathcal{L}(\mathbf{w}, \alpha, \beta)=f(\mathbf{w})+\sum_{i=1}^n \alpha_i g_i(\mathbf{w})+\sum_{j=1}^l \beta_j h_j(\mathbf{w}) \]
\[ \frac{\partial \mathcal{L}}{\partial w_i}=0 \]
\[ d^*=\max _{\alpha, \beta: \alpha_i \geq 0} \min _{\mathbf{w}} \mathcal{L}(\mathbf{w}, \alpha, \beta) \leq \min _{\mathbf{w}} \max _{\alpha, \beta: \alpha_i \geq 0} \mathcal{L}(\mathbf{w}, \alpha, \beta)=p^* \]
We first write out the Lagrangian:
\[ L(\mathbf{w}, b, \alpha)=\frac{1}{2}\|\mathbf{w}\|^2-\sum_{i=1}^n \alpha_i\left\{y_i\left(\mathbf{w} \cdot \mathbf{x}_i+b\right)-1\right\} \]
Then take derivative with respect to \(b\), as it is easy!
\[ \frac{\partial L}{\partial b}=-\sum_{i=1}^n \alpha_i y_i \]
Setting it to 0 , solves for the optimal \(b\), which is a constraint on the multipliers
\[ 0=\sum_{i=1}^n \alpha_i y_i \]
Then take derivative with respect to w :
\[ \frac{\partial L}{\partial w_j}=w_j-\sum_{i=1}^n \alpha_i y_i x_{i, j} \]
\[ \frac{\partial L}{\partial \mathbf{w}}=\mathbf{w}-\sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \]
Setting it to 0 , solves for the optimal w :
\[ \mathbf{w}=\sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \]
Then we now can use these as constraints and obtain a new objective:
\[ \tilde{L}(\alpha)=\sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \]
\[ \tilde{L}(\alpha)=\sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j \]
\[ \begin{gathered} \alpha_i \geq 0 \quad \forall i \\ \sum_{i=1}^n \alpha_i y_i=0 \end{gathered} \]