02 24 2024

Making a Toy Gradient Descent Implementation

1 Introduction

I’ve recently came across a few of Andrej Karpathy‘s video tutorial series on Machine Learning and I found them immensely fun and educational. I highly appreciate his hands-on approach to teaching basic concepts of Machine Learning.

Richard Feynman once famously stated that “What I cannot create, I do not understand.” So here’s my attempt to create a toy implementation of gradient descent, to better understand the core algorithm that powers Deep Learning after learning from by Karpathy’s video tutorial of micrograd:

https://github.com/hxy9243/toygrad

Even though there’s a plethora of books, blogs, and references that explains the gradient descent algorithm, it’s a totally different experience when you get to build it yourself from the scratch. During this course I found there are quite a few knowledge gaps for myself, things that I’ve taken for granted and didn’t really fully understand.

And this blog post is my notes during this experience. Even writing this post helped my understanding in many ways.

Training and inference example using toygrad

2 Key Concepts

2.1 Chain Rule

In Calculus, the chain rule of derivative shows the basic rules for finding the derivatives of the composite functions.

As we learned from Calculus class, the chain rule states:

$$ \frac{df(g(x))}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} $$

For example, the derivative of the function $f(x) = sin x^2$ is:

$$ \frac{dsin(x^2)}{dx} = \frac{dsin(x^2)}{dx} \cdot \frac{dx^2}{x} = cos(x^2) \cdot 2x $$

With chain-rule, we could derive the auto-grad algorithm to implement backpropagation on complex compute graphs.

2.2 Backpropagation

Backpropagation, or backward propagation of errors, is the algorithm to find the derivative of the loss function (which is a function that computes the difference between prediction and the actual output data).

It calcuates the gradient backwards through the feed-foward network from the last layer to the first.

There are 4 steps to the Backpropagation algorithm:

Forward Pass: where the input data is fed through the model (e.g. a Deep Neural Network) and get the prediction output.
Loss Computation: where the prediction output is compared with the actual fed data with a function (e.g. Mean Squred Error, or Cross Entropy Loss). This is the loss function that we’ll aim to minimize ($loss = J(w)$).
Backward Pass: this is where gradient descent comes in. In this step, we need to find the derivative of the loss function at each parameter ($\frac{\partial J}{\partial wn}$). Instead of deriving a formula for the derivative of the loss function, we can then apply the chain rule to derive the gradient descent algorithm so as to find the gradients of all parameters backward the computational graph.
Update Parameters: After getting the gradients for each parameter, we update all parameters by substracting the gradient timed by a small value that we call the learning rate.

And we repeat this 4-step process until the loss is close to the minimal value.

Conceptually, the gradients represent the slope rate at the level of the current parameters. So we could substract the parameters at the direction of the gradient by a small amount (decided by the learning rate). It’s a process of moving the loss function closer to the minimal value, at a speed defined by learning rate.

There are a lot of tricks and optimizations to adjusting the learning rate, but that’s outside the scope of this discussion.

2.3 AutoGrad

AutoGrad, or Automatic Differentiation, is the core of the backpropagation algorithm in the backward step. It computes the gradients of all parameters in a reverse manner, calculating the derivative of all parameters of the function by applying the gradient backwards in the compute graph.

And here’s the explanation why the AutoGrad algorithm makes sense.

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#differentiation-in-autograd

Autograd explanation, from Karpathy's tutorial

The AutoGrad algorithm could be derived by the chain rule. The chain rule in Calculus states that:

$$ \frac{\partial y}{\partial x} = \frac{\partial y}{\partial w1} \cdot \frac{\partial w1}{\partial x} = (\frac{\partial y}{\partial w2} \cdot \frac{\partial w2}{\partial w1}) \cdot \frac{\partial w1}{\partial x} = ((\frac{\partial y}{\partial w3} \cdot \frac{\partial w3}{\partial w2}) \cdot \frac{\partial w2}{\partial w1}) \cdot \frac{\partial w1}{\partial x} $$

Assuming in this simple case, w3 is the result of a computation from w2, i.e. the successor of w2 in the compute graph, we can derive that:

$$ \frac{\partial y}{\partial w2} = \frac{\partial y}{\partial w3} \cdot \frac{\partial w3}{\partial w2} $$

When we need to find the gradients of all the parameters, we just need to apply this chain rule backward the computational graph. The gradient of each intermediate variables, denoted as:

$$ \bar w_{i} = \frac{\partial{y}}{\partial w_{i}} $$

would be the sum of all gradients of its successors in the operation.

With this chain rule in mind, we could start with the final result of the equation and work upward the compute graph. It has a seed gradient value of 1 (as $\frac{\partial y}{\partial y} = 1$), and back-propagate the gradients back to all intermediate parameters.

See more at: https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation

3 ToyGrad Architecture

3.1 Value Engine Design

The core of this autograd engine design is the Value class that could both express the feed-forward computation as well as the backward propagation of gradients. These are the steps when considering its design:

Basic arithmatics: overriding the basic arithmatics of a Value.
Feed-forward computation: compute the basic elements, overriding the operators of the Value class.
The gradient computation: the basic feedfoward computation and its corresponding backward computation. Each backward computation is defined by the derivative of each operation.

For example, in the case of multiplication:

class Value:
    def __init__(self, val):
        self.val = val
        self.grad = 0
        self.operands = set()
        self._backward = lambda: None

    def __mul__(self, other):
        other = Value(other) if not isinstance(other, Value) else other

        # new returned value is the computated value of the operator
        v = Value(self.val * other.val)
        v.operands = set((self, other))

        # here we update the gradient of each parameter from the gradient of the result value
        def _backward():
            self.grad += other.val * v.grad
            other.grad += self.val * v.grad

        v._backward = _backward
        return v

    ...

By calling backward() on the final result of the compute graph, we first assign a gradient of 1 to the final value, and compute the gradients of all operands of the final value. We can do it in two steps:

Create a compute order by reversely traversing the compute graph (a reverse topological sort).
Apply the backward() function on each one of the values in the compute order, thus updating the gradients of all parameters.

3.2 Neural Network Design

With the Value engine designed, we could now chain these value computation to form a feed-forward neural network with dense layers, where all Values are connected to the next layer’s values.

Example Neural Network, form Wikipedia

We’ll also need code for:

Linear Dense Layer: a dense layer connects to all weights from previous layer, added by a bias: $y = W \cdot X + b$.
Neural network: create the layers and the whole neural network. Each layer could be defined as the matrix multiplication of the weights and previous layer added by bias. We can then call the forward() and backward() step on the neural network after feeding it the training input and output data.
Parameter update: for each training iteration step, update all the parameters. This is usually done by the optimizer in a real implementation.
Training process: we glue everything together in the training step implementation.

Here’s an outline of the code that describes the steps to a model definition and training process:

class Model(Module):
    def __init__(self, input_features, output_features):
        super().__init__()

        self.layer1 = Linear(input_features, 8)
        self.layer2 = Linear(8, 4)
        self.output = Linear(4, output_features)

    def forward(self, X):
        X = self.layer1(X)
        X = [xi.relu() for xi in X]
        X = self.layer2(X)
        X = [xi.relu() for xi in X]
        output = self.output(X)
        return output

    def train(self, X, y, epochs=1, learning_rate=1e-5):
        for epoch in range(epochs):
            # loss = 0.0
            final_cost = Value(0.0)

            # get cost function
            for i, xval in enumerate(X):
                out = self.forward(xval)
                cost = ((y[i] - out) ** 2)[0]
                final_cost += cost

            final_cost = final_cost / len(X)

            # backward
            final_cost.backward()

            # apply the grad with learning rate
            for p in self.parameters():
                p.val -= p.grad * learning_rate

            # zero out the grads
            for p in self.parameters():
                p.grad = 0.0

            loss = final_cost.val

    ...

If you reached this far, I highly recommend the video tutorial of micrograd from Andrej Karpathy himself, along with his implementation. He explains it way better than I could.

Also let me know if you found any problems in this blog post or my version of implementation: