Monday, November 9, 2020

Yet Another Backpropagation Blog Post

Yet Another Backpropagation Blog Post

At this point, there are way too many backprop posts on the net, including another article titled Yet Another Introduction to Backpropagation. I add another one to the pile as proof to myself that I have understood this reasonably well, and perhaps something in here may help out someone who was stuck with backprop, and wandered onto this blog for help. (So, yes, just the former reason.)

Backpropagation is a more general process, and I cover here only backpropagation for a feedforward neural network. Note that the Wikipedia page on backpropagation was the clearest to me, and this post is largely derived from that one. This post assumes you have already attempted to grok backprop at least once and covers only the nuances of the operation, and is not meant for a high level overview.

Prerequisites

  • Matrix differentiation: I recommend this post which is long but worthwhile.
  • Numerator and denominator convention: This is part of matrix differentiation, but especially important. The Wikipedia page covers it well, and also succintly states an issue with the conventions: Authors of both groups (numerator and denominator convention) often write as though their specific convention were standard.
  • High-level overview of how a feed-forward neural network works.
  • Understanding of why the negative of the gradient of a function at a point is the direction of its steepest descent.

Notation

Having sensible notation clears out a lot of issues in the understanding of backprop (or a lot of other concepts, but more so here because of the oh-so-many vectors and indices and matrices)

For the rest of this article, consider a neural network with n+1n+1 layers, labelled 00 to nn. Layer 00 is the input layer, and layer nn is the output layer. So, excluding the input and output layers, our network has n1n-1 layers. We follow 0-based convention just to make it easier while coding this up.

Every layer ii is associated with two lil_i–length vectors ( or li×1l_i \times 1 matrices ) ziz_i and aia_i. Let us whip out the feedforward equations to see how they are related (Also look at the diagram below):
zi+1=Wi×aiai=σ(zi) z_{i+1} = W_i \times a_i\\ a_i = \sigma(z_i)
A few things to remember:

  • Non-linear function σ\sigma acts on ziz_i of a layer and produces aia_i (of the same layer, as shown by the index). Although we write it here as acting on a vector and producing a vector, the function (for example, sigmoid) that we talk about in this article acts on a scalar and produces a scalar. When written as shown above, we actually mean:
    ai=[σ(zi0)      σ(zi1)      ...      σ(zili1)]a_i = \begin{bmatrix} \sigma(z_i^0) \;\;\; \sigma(z_i^1) \;\;\; ... \;\;\; \sigma(z_i^{l_i - 1})\end{bmatrix}
  • WiW_i is a matrix (the weight matrix) that acts on vector aia_i (weight matrices have been labelled based on the vector they act upon) and produces zi+1z_{i+1} of the next layer.
  • The dimensionality of matrix WiW_i is therefore equal to li+1×lil_{i +1} \times l_i
  • a0a_0 is equal to the input vector, and that is the starting point of our recurrence. z0z_0 is never used.
  • znz_n is the output vector, and the end of our recurrence. ana_n is never used.
  • Note that we have not yet labelled any elements within each of the layer-vectors or connecting weight matrices.

Below is a schematic diagram for 44 layers labelled 00 to 33. (Here, nn = 3)

[...]z0,a0:  l0×1    W0l1×l0    [....]z1,a1:  l1×1    W1l2×l1    [.....]z2,a2:  l2×1    W2l3×l2    [..]z3,a3:  l3×1 \underbrace{ \begin{bmatrix} . \\ . \\ . \end{bmatrix} }_{z_0, a_0: \;l_0 \times 1} \,\,\, \,\underbrace{W_0 }_{l_1 \times l_0}\,\,\,\, \underbrace{ \begin{bmatrix} . \\ . \\ .\\ . \end{bmatrix} }_{z_1, a_1: \;l_1 \times 1} \,\,\, \,\underbrace{W_1 }_{l_2 \times l_1}\,\,\,\, \underbrace{ \begin{bmatrix} . \\ . \\ . \\ . \\ . \\ \end{bmatrix}}_{z_2, a_2: \;l_2 \times 1} \,\,\, \,\underbrace{W_2 }_{l_3 \times l_2}\,\,\,\, \underbrace{ \begin{bmatrix} . \\ . \\ \end{bmatrix}}_{z_3, a_3: \;l_3 \times 1}

Eventually, we will also need to refer to the elements within each vector and matrix. Since the subscripts are all used up for referring to the vectors and matrices, we will use superscripts for elements. The elements also follow the same 0-based indexing as the vectors and matrices. Do not confuse these for powers. We never use powers in this article, (except for the error, and there it is too clear to be confused).

For example, z02z_0^2 refers to the third element of z0z_0. W213W_2^{13} refers to the element in the second row and fourth column of W2W_2. We will never go as far as two-digit numbers so 1313 in the superscript should never be confused as thirteen.

Having the matrix indices as subscripts and element indices as superscripts is probably opposite to usual convention, but this is useful, as when we refer to matrix transposes, we don’t refer to elements, so that allows us to write TT in the superscript and the matrix index in the subscript as in W2TW_2^T.

Setting up the calculations

Let us assume for now that we have a training set with only one data point, that is, we have been given one input vector (x)(x) and one output vector (y)(y). We will later generalise to a training set with multiple data points.

The error (E)(E) is a function that indicates how different the output vector of the neural network (zn)(z_n) is from the expected output (y)(y). Here, we define it as as (yzn)2(y - z_n)^2.

The goal at the end of training is to figure out the right weight matrices {Wi}\{W_i\} that minimise this error.

The goal at the end of backpropagation is to find the changes that need to be made to each weight matrix to get closer to the right weight matrix.

Because the direction of steepest descent of a function is given by the negative of the gradient, the change dWidW_i we add to the weights is equal to a multiple of the negative of the gradient of EE with respect to WiW_i.
Therefore we have,
dWi=m.EWidW_i = -m.\frac{\partial E}{\partial W_i}

mm is a hyperparameter that is called “learning rate” which we will worry about later. How does the right hand side of the above equation look? We have one scalar EE whose derivative we take with respect to WiW_i, so the total number of elements in our derivative is equal to the total number of elements in WiW_i.

However, there are two different ways of taking the derivative of a scalar with respect to a matrix, and both will leave you with different dimensionalities.

The numerator convention ensures that dWidW_i has the same dimensions as WiTW_i^T. The denominator convention ensures that dWidW_i has the same dimensions as WiW_i (and all the corresponding elements are correct so we can just add dWidW_i to WiW_i to get our new WW. For this step only, we will use the denominator convention, since it is so convenient.

So, to accomplish backpropagation, it is sufficient to compute EWi\frac{\partial E}{\partial W_i} for every ii from 00 to n1n-1 (Reminder: There are nn weight matrices and n+1n+1 layers. There is no need for a weight matrix to act on the last layer.)

Writing down the chain rule

We will write down the EWi\frac{\partial E}{\partial W_i} for every matrix in the example with n=3n=3 drawn above. Let’s put down the relevant equations here once again:

zi+1=Wi×aiai=σ(zi)E=(yzn)2 z_{i+1} = W_i \times a_i\\ a_i = \sigma(z_i) \\ E = (y - z_n)^2

As an aside, before we start, it helps to remember that the training data is the set of constants that we have and the weights are the variables that we’re changing. This is easy to forget, especially because we use xx and yy for the training data.

Let us write down the above equations explicitly for every index.

E=(yz3)2z3=W2×a2a2=σ(z2)z2=W1×a1a1=σ(z1)z1=W0×a0a0=x\begin{aligned} E &= (y - z_3)^2 \\ z_3 &= W_2 \times a_2 \\ a_2 &= \sigma(z_2) \\ z_2 &= W_1 \times a_1 \\ a_1 &= \sigma(z_1) \\ z_1 &= W_0 \times a_0 \\ a_0 &= x \end{aligned}

We will just write the derivatives by the chain rule (keeping aside dimensionality for now)
EW2=Ez3×z3W2EW1=Ez3×z3a2×a2z2×z2W1EW0=Ez3×z3a2×a2z2×z2a1×a1z1×z1W0\begin{aligned} \frac{\partial E}{\partial W_2} &= \frac{\partial E}{\partial z_3} \times \frac{\partial z_3}{\partial W_2}\\ \frac{\partial E}{\partial W_1} &= \frac{\partial E}{\partial z_3} \times \frac{\partial z_3}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial z_2}{\partial W_1} \\ \frac{\partial E}{\partial W_0} &= \frac{\partial E}{\partial z_3} \times \frac{\partial z_3}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial z_2}{\partial a_1} \times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_1}{\partial W_0} \\ \end{aligned}

There are a few things to notice now:

  • We can see that if there were more layers, the same pattern would follow for all the weight matrices, so whatever we do for the above generalises to more layers.
  • We notice that there is a repetition in terms that are calculated: The first term calculated for the W2W_2 expression is the same as the first term calculated for the W1W_1 expression. The first three terms calculated for the W1W_1 expression are the same as the first three terms for the W0W_0 expression. This repetition would continue were we to add more layers.
  • The LHS of each side, we already decided should have the dimensionality of WiW_i so that we can just add it to WiW_i.
  • Let us take a look at the chain rule terms:
    – The first term Ez3\frac{\partial E}{\partial z_3} is the derivative of a scalar with respect to a vector. In numerator convention, this gives us a matrix of the same dimensions as z3Tz_3^T, that is 1×l31 \times l_3.
    – Look at the term z3a2\frac{\partial z_3}{\partial a_2}. In numerator convention, the dimensionality of this term is l3×l2l_3 \times l_2.
    – Leave out the last term in each expression (the derivative with respect to WiW_i), and we notice that assuming numerator convention, the chain rule works out perfectly – that is, the dimensions of each term make the expression compatible for multiplication. This is no surprise, since it is the whole point of the convention.
  • An issue you probably can notice is the bad-looking last term, which is a derivative of a vector with respect to a matrix, which is not defined well in terms of vectors or matrices.

As a summary, for each expression, we have (i) the first few terms multiplying out to give a nice derivative, (ii) a last term hat we don’t know how to express, and (iii) everything needs to be put together in denominator convention so that we can add it to WiW_i.

Let us deal with these issues one by one.

Really digging into the specifics

Just so that we can have atleast one constant amongst all these variable indices in our discussion, we will only run over the specifics for calculating EW0\frac{\partial E}{\partial W_0}. The derivatives with respect to the other weight matrices follow exactly the same procedure.

(i) The nice derivative from the first few terms:

Consider multiplying all terms but the last one for EW0\frac{\partial E}{\partial W_0}. This is the expression for Ez1\frac{\partial E}{\partial z_1}. And since we followed numerator convention while calculating this, we know the final multiplied value has dimensions 1×l11 \times l_1. It is the following row matrix:
Ez1=[Ez10      Ez11      Ez12      ...      Ez1l11]\frac{\partial E}{\partial z_1} = \begin{bmatrix} \frac{\partial E}{\partial z_1^0} \;\;\; \frac{\partial E}{\partial z_1^1} \;\;\; \frac{\partial E}{\partial z_1^2} \;\;\; ... \;\;\; \frac{\partial E}{\partial z_1^{l_1-1}} \end{bmatrix}

(ii) That pesky last term: z1W0\frac{\partial z_1}{\partial W_0}

The derivative of a vector with l1l_1 terms with respect to a matrix with l1×l0l_1 \times l_0 terms will have l0×l1×l1l_0 \times l_1 \times l_1 terms. There is no well defined matrix to capture this derivative. What we will do here is not calculate this last term at all, and instead try to see if there are any observations for this particular problem that help us get around calculating it.

(ii) Putting everything together

Forget the last term for now. Let’s look at the LHS EW0\frac{ \partial E}{\partial W_0}:

E=f(z1,...)E = f(z_1, ...)
z1=W0×a0z_1 = W_0 \times a_0

Although EE depends on other terms, those terms don’t depend on W0W_0 and are hence irrelevant for our below partial derivative. This is what each term of our partial should be:
EW0jk=iEz1iz1iW0jk{\frac{ \partial E}{\partial W_0}}^{jk} = \sum_i \frac{\partial E}{\partial z_1^i} \frac{\partial z_1^i}{\partial W_0^{jk}}
There are two terms in the RHS. In each term of the summation, the first partial we already have from the vector of the previous section. The second partial, we don’t know. But what is the second partial? Since z1=W0×a0z_1 = W_0 \times a_0, we know that each term z1i=rW0ira0rz_1^i = \sum_r W_0^{ir}a_0^r. This implies:
z1iW0jk=a0kδij\frac{\partial z_1^i}{\partial W_0^{jk}} = a_0^k \delta_{ij}
And since the summation for EW0jk{\frac{ \partial E}{\partial W_0}}^{jk} runs through every ii, and we only get a non-zero value when i=ji=j, we have:
EW0jk=Ez1ja0k{\frac{ \partial E}{\partial W_0}}^{jk} = \frac{\partial E}{\partial z_1^j} a_0^k

Phew! That was simplified quite a bit!
Now, how do we get the above by multiplying matrices we already have? This is not too hard:
EW0=(Ez1)Ta0T{\frac{ \partial E}{\partial W_0}} = {\left(\frac{\partial E}{\partial z_1}\right)}^Ta_0^T
That’s it!

(iv) In summary (of this section)

To calculate EWi{\frac{ \partial E}{\partial W_i}}, write down the chain rule expression as shown in the Writing down the chain rule section above. Calculate each term following numerator convention. Multiply all but the last term to get Ezi+1\frac{\partial E}{\partial z_{i+1}} Finally multiply its transpose with aiTa_i^T to get EWi{\frac{ \partial E}{\partial W_i}}.

The in-between calculations

In the previous section, for the first few terms, we just said calculate them using numerator convention but we didn’t specify how to do them. Go back up to the Writing down the chain rule section and notice that apart from the first and the last term, there are basically two kinds of derivatives: zaza and azaz.
To calculate an azaz derivative:
aizijk=σ(zij)zik=σ(zij)δjk {\frac{\partial a_i}{{\partial z_i}}}^{jk} = \frac{\partial \sigma(z_i^j)}{{\partial z_i^k}} = \sigma'(z_i^j)\delta_{jk}
Therefore,
aizi=diag([σ(zi0)      σ(zi1)      ...      σ(zili1)]\frac{\partial a_i}{{\partial z_i}} = diag( \begin{bmatrix}\sigma'(z_i^0) \;\;\;\sigma'(z_i^1) \;\;\; ... \;\;\; \sigma'(z_i^{l_i-1})\end{bmatrix}

To calculate a zaza derivative:
We have,
zi+1=Wi×aiz_{i+1} = W_i \times a_i
Therefore,
zi+1j=rWijrairz_{i+1}^{j} = \sum_r W_i^{jr}a_i^r
and
zi+1aijk=Wijk{\frac{\partial z_{i+1}}{\partial a_i}}^{jk} = W_i^{jk}
So, we have:
zi+1ai=Wi\frac{\partial z_{i+1}}{\partial a_i}= W_i

That’s it! We have the zaza and the azaz derivatives, and now we’re good to go!

The Dynamic Programming Portion

It is worth noting that if you do calculate the expressions from the n1thn-1^{th} index to the 0th0^{th} index, then you can use a portion of the previous solution to help out with the next solution, and hence it makes sense doing the calculations backward – which is where the name backpropagation originates from.

More than one data point

We do hope you’re going to use this for a dataset of a size larger than one. Of course, if that’s your way of guaranteeing 0 error, more power to you. However for tt training samples, you should apply backpropagation for each training sample and add up the dWidW_i you get for each training sample. Remember that the each dWidW_i is obtained by multiplying the learning rate mm with the negative of the gradient which we saw how to calculate above.
dWitotal=dWi0+dWi1+...+dWit1dW_i^{total} = dW_i^0 + dW_i^1 +... + dW_i^{t-1}
We then update the weights by adding dWidW_i to every weight WiW_i and repeat the process.

Finally, the code

Code implementing the above, with documentation and testing examples, can be found in this github repository. Feel free to comment here or send PR’s on the repo!

Share:

0 comments:

Post a Comment