Resources:

Download: Linear Regression Cheat Sheet

GitHub: Linear Regression Model

Gradient Descent Motivation

Machine Learning has 2 phases: Learning and Prediction.

Learning: The learning algorithm job is to create a hypothesis function.
Prediction: The hypothesis functions job is to predict outputs.

The cost function represents the error between predicted output and actual outputs. We can tune parameters of learning algorithm to minimize the error. Gradient descent is a popular way of doing so.

Gradient Descent Algorithm

Goal: Find a local minima of Cost Function, $J\left(\theta\right)$

Start with any value for parameters of learning.
1. Can pick random value or initiate to 0. Doesn’t matter much for linear regression.

Example, $\theta_0=\vec{0}$

parameters = np.zeros(dimensions + 1)

Update the parameters to go down fast (on the surface of the cost function)

$\theta _j:=\theta _j-\alpha\frac{\partial}{\partial x}\left(J\left(\theta\right)\right)$

$\theta _j:=\theta _j+\alpha\left(y^{\left(i\right)}-h_{\theta}\left(x^{\left(i\right)}\right)\right)x_j^{\left(i\right)}$

parameters += learning_rate * (y_i - predict(x_i)) * x_i

Repeat until you get to a local optimum.
1. Strong sign of a local optimum when the cost function is no longer significantly decreasing, given that the learning rate is small.

Gradient Descent Example

Linear regression model with a gradient descent based learning algorithm

"""Machine Learning Models"""

import numpy as np


class LinearRegression:
    """Linear Regression"""

    def __init__(self, dimensions: int = 1):
        """init

        Parameters
        ----------
        dimensions
            number of dimensions for linear regression, by default 1
        """
        # Initilize parameters
        self.parameters = np.zeros(dimensions + 1)

    def learn(self, x: np.ndarray, y: np.ndarray, learning_rate: float, max_iter: int):
        """learn maps input-output training dataset to a hypothesis function

        Parameters
        ----------
        x
            input data
        y
            output data
        learning_rate
            learning rate
        max_iter
            maximum number of iterations, currently equal to total iterations
        """
        # Gradient descent
        for _ in range(max_iter):
            for xy in zip(x, y):
                x_i, y_i = self.reconstruct_input_vec(xy[0]), xy[1]
                self.parameters += learning_rate * (y_i - self.predict(x_i)) * x_i

    def reconstruct_input_vec(self, x: np.ndarray | float):
        """Create a dummy input, x_0, always defined as 1

        Parameters
        ----------
        x
            input

        Returns
        -------
            (x_0,...x_n), x_0 = 1
        """
        return np.concatenate(([1], np.atleast_1d(x)))

    def predict(self, x: np.ndarray):
        """Predict using linear regression hypothesis function

        Parameters
        ----------
        x
            input

        Returns
        -------
            prediction
        """
        return np.dot(self.parameters, x)

Batch Vs Stochastic

Batch gradient descent scans through the entire training set before taking a single step, where a stochastic gradient descent takes a step with each example. Stochastic is often preferred over batch gradient descent when the data set is large.

Gradient Descent