Optimization Basics

CISC 484/684 – Introduction to Machine Learning

Sihan Wei

June 24, 2025

Measuring Size: \(L_p\) Norms

  • A norm is a way to measure the size or length of a vector
  • The most common family of norms are the \(L_p\) norms:

\[ \|\mathbf{x}\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p} \]

  • Depending on \(p\), the norm behaves differently.

Examples of \(L_p\) Norms

  • \(L_1\) norm (Manhattan norm): \(\|\mathbf{x}\|_1 = \sum_i |x_i|\)
    • “City block” distance
  • \(L_2\) norm (Euclidean norm): \(\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}\)
    • Straight-line distance
  • \(L_\infty\) norm (Max norm): \(\|\mathbf{x}\|_\infty = \max_i |x_i|\)
    • Just the biggest coordinate!

Why Do We Care?

  • Norms are used to:
    • Measure distance between vectors
    • Penalize model complexity (e.g., in regularization)
    • Define constraints in optimization
  • In ML, you’ll often see:
    • \(\|\mathbf{w}\|_2^2\) (Ridge regularization)
    • \(\|\mathbf{w}\|_1\) (Lasso sparsity)

Summary

  • The \(L_p\) norm generalizes “length” for different values of \(p\)
  • \(L_2\) is most common (Euclidean)
  • \(L_1\) is useful when we want sparsity
  • Norms will appear everywhere: loss functions, constraints, gradients

Recall

For a linear regression problem, we have the closed-form solution:

\[\mathbf{w}=\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y}\]

This is awesome, but:

  • What if we only see one \((x_i, y_i)\) at a time?
  • What if we have too much data to compute \((X^\top X)^{-1}\)?

Convexity

  • “The maximum is when the gradient is equal to 0”
    • Is that true?

Convex Sets

\(A\) set \(S\) is convex if the line segment between any two points of S lies in S . That is, if for any \(x, y \in S\) and any \(\lambda \in[0,1]\) we have

\[ \lambda x+(1-\lambda) y \in S \]

  • Left: convex set
  • Middle: non-convex set
  • Right: non-convex set

Convex Functions

  • A function \(f: R^d \rightarrow R\) is convex if (i) \(\operatorname{dom}(f)\) is a convex set and (ii) \(\forall x, y \in \operatorname{dom}(f)\) and \(\lambda \in[0,1]\) we have

\[ f(\lambda x+(1-\lambda) y) \leq \lambda f(x)+(1-\lambda) f(y) \]

  • More intuitively, the line segment between any two points on its graph lies above the graph itself!

  • Benefits of convex functions: Every local minimum is a global minimum!

Convex Functions

First-order condition

  • Suppose that \(\operatorname{dom}(f)\) is open and that \(f\) is differentiable. Then \(f\) is convex if and only if \(\operatorname{dom}(f)\) is convex and for all \(x, y \in \operatorname{dom}(f)\) the following holds:

\[ f(y) \geq f(x)+\nabla f(x)^{\top}(y-x) \]

  • Remark: The graph of the affine function \(f(x)+\) \(\nabla f(w)^{\top}(y-x)\) is a tangent hyperplane to the graph of \(f\) at ( \(x, f(x)\) ).
  • That is, graph of function \(f\) is above all of its tangent hyperplanes.

First-order Condition

Examples of Convex Functions

Linear function: \(f(x)=a^{\top} x\)

  • Affine function: \(f(x)=a^{\top}+b\)
  • Exponential function: \(f(x)=e^{a x}\)
  • Quadratic with \(a>0: f(x)=a x^2\)
  • Norms on \(R^d: f(x)=\|x\|_2\)

Loss functions

0-1 loss is ideal but hard to optimize!

  • Non-convexity: there may be many local minima, makes it difficult for an optimization algorithm to find the best solution (global minimum).
  • Discontinuity: any small change in the model parameters could lead to sudden jumps in the loss value, makes gradient-based algorithm impossible to use.
  • No useful gradient information: the gradient is either 0 or undefined, and hence optimization algorithms cannot make informed updates.

Convex Surrogate Losses

  • Hinge loss (in SVM): \(\ell_{\text{hin}}=\max\{0,1-z\}\)
  • Logistic loss (in logistic regression): \(\ell_{\text{log}}=\log(1+\exp(-z))\)
  • Exponential loss (in AdaBoost): \(\ell_{\text{exp}}=\exp(-z)\)

Gradient Descent: What Are We Trying to Do?

  • We want to minimize a loss function.
  • Imagine we’re trying to find the lowest point in a valley.

Going Downhill

  • Think of the loss surface as a hilly terrain.

  • You’re standing somewhere on the surface.

  • How do you move downhill toward the minimum?

  • You need two things:

    • A direction (where to go)
    • A step size (how far to go)
  • In machine learning, this is gradient descent!

What is the Gradient?

  • The gradient tells you the direction of steepest ascent
    ⟶ So to minimize, we go in the opposite direction

  • Mathematically: \[ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla L(\mathbf{w}_t) \]

  • Where:

    • \(\nabla L(\mathbf{w}_t)\) is the gradient of the loss
    • \(\eta\) is the step size (aka learning rate)

Gradient Based Optimization

Our goal is to minimize the objective function

\[ \min _w f(w)=\frac{1}{n} \sum_{i=1}^n f_i(w) \]

Batch optimization

\[ w^{t+1}=w^t-\gamma_t \nabla f\left(w^t\right)=w^t-\gamma_t \frac{1}{n} \sum_{i=1}^n f_i\left(w^t\right) \]

Combine all derivatives into a single update

Gradient Based Optimization

  • Pro
    • Use all available information to update parameters
  • Con
    • Slow: must consider every data point
    • How different is an update with n-1 vs. n examples?

GD for Linear Regression

The squared loss function for the entire dataset is

\[ f(\mathbf{w})=\frac{1}{2 n} \sum_{i=1}^n\left(\mathbf{w}^{\top} \mathbf{x}_i-y_i\right)^2=\frac{1}{2 n}\|\mathbf{X} \mathbf{w}-\mathbf{Y}\|_2^2 \]

The gradient is

\[ \nabla f(\mathbf{w})=\frac{1}{n} \sum_{i=1}^n\left(\mathbf{w}^{\top} \mathbf{x}_i-y_i\right) \mathbf{x}_i=\frac{1}{n} \mathbf{X}^{\top}(\mathbf{X} \mathbf{w}-\mathbf{Y}) \]

For \(t\)-th iteration, we update

\[ \mathbf{w}^{t+1}=\mathbf{w}^t-\gamma_t \frac{1}{n} \sum_{i=1}^n\left(\mathbf{w}^{t^{\top}} \mathbf{x}_i-y_i\right) \mathbf{x}_i \]

and the matrix version is

\[ \mathbf{w}^{t+1}=\mathbf{w}^t-\gamma_t \frac{1}{n} \mathbf{X}^{\top}\left(\mathbf{X} \mathbf{w}^t-\mathbf{Y}\right) \]

Gradient Descent

A typical GD algorithm

  • We show pseudocode for gradient descent
  • Initialize \(w\) and a threshold parameter \(\epsilon\) (e.g. 1e-3)
  • Repeat until convergence
    • Choose learning rate \(\gamma_t\) (could be a constant)
    • Compute the full gradient \(\nabla f(w)\)
    • Update parameter \(w^{t+1}=w^t-\gamma_t \nabla f\left(w^t\right)\)
    • If \(\left\|w^{t+1}-w^t\right\| \leq \epsilon\), exit the loop

Stochastic Gradient Descent

  • Update rule: \(w^{t+1}=\mathrm{w}^{\mathrm{t}}-\gamma_t \nabla f_i\left(w^t\right)\)
  • \(\nabla f_i(x)\) is an unbiased estimator of the true gradient \(\nabla f(x)\). That is, \(\mathbb{E}\left[\nabla f_i(x)\right]=\nabla f(x)\)
  • Updates parameters after each example
  • Pro– Fast: each update considers a single example, improves parameters faster
  • Con– Noisy updates: Each update based on very little information

Stochastic Gradient Descent

Uniform Single-element Sampling SGD

At each iteration, we do the following steps:

  • Uniformly at random select \(i \in[n]\left(p_i=\frac{1}{n}\right)\)
  • Compute \(\nabla f_i\left(w^t\right)\)
  • Update \(w^{t+1}=w^t-\gamma_t \nabla f_i\left(w^t\right)\)
  • Note there are many other sampling methods!

SGD for Linear Regression

Recall

\[ f(\mathbf{w})=\frac{1}{n} \sum_{i=1}^n f_i(\mathbf{w})=\frac{1}{2 n} \sum_{i=1}^n\left(\mathbf{w}^{\top} \mathbf{x}_i-y_i\right)^2 \]

Hence, the squared loss for a single data sample ( \(\mathbf{x}_i, y_i\) ) is

\[ f_i(\mathbf{w})=\frac{1}{2}\left(\mathbf{w}^{\top} \mathbf{x}_i-y_i\right)^2 \]

The gradient of \(f_i\) is

\[ \nabla f_i(\mathbf{w})=\left(\mathbf{w}^{\top} \mathbf{x}_i-y_i\right) \mathbf{x}_i \]

For \(t\)-th iteration, we update

\[ \mathbf{w}^{t+1}=\mathbf{w}^t-\gamma_t\left(\mathbf{w}^{t^{\top}} \mathbf{x}_i-y_i\right) \mathbf{x}_i \]