Derivative of Cost Function with Respect to Output Layer Weight

Zap · Feb 27, 2021

This is an issue I've been stuck on for about two weeks. No matter how many times I take this derivative, I keep getting the same answer. However, this answer is inevitably wrong. Please help me to understand why it incorrect.

To start, I will define an input matrix ##X##, where ##n## is the number of features or independent variables in the neural network model, and ##m## is the number of examples.

$$
X = \begin{bmatrix}
x_{1,1} & x_{1,2} & \dots & x_{1,n} \\
x_{2,1} & x_{2,2} & \dots & x_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
x_{m,1} & x_{m,2} & \dots & x_{m,n}
\end{bmatrix}
$$

Next, I will define an output matrix, which is the output of a feedforward neural network after forward propagation of the input matrix ##X##. In other words, this is the model's output ##\hat{Y}##, where ##m## is the number of examples in the input matrix, and ##k## is the number of perceptrons in the output layer.

$$
\hat{Y} = \begin{bmatrix}
\hat{y}_{1,1} & \hat{y}_{1,2} & \dots & \hat{y}_{1,k} \\
\hat{y}_{2,1} & \hat{y}_{2,2} & \dots & \hat{y}_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
\hat{y}_{m,1} & \hat{y}_{m,2} & \dots & \hat{y}_{m,k} \\
\end{bmatrix}
$$

The output matrix can also be written as a matrix of the activation or output of each perceptron in the output layer for each element ##/hat{y}_{m,k}## in the output vector for all examples ##m##, as seen below, where ##a^{(ℓ)}_{m,k}## is the activation value of the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$
\hat{Y} = \begin{bmatrix}
a^{(ℓ)}_{1,1} & a^{(ℓ)}_{1,2} & \dots & a^{(ℓ)}_{1,k} \\
a^{(ℓ)}_{2,1} & a^{(ℓ)}_{2,2} & \dots & a^{(ℓ)}_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
a^{(ℓ)}_{m,1} & a^{(ℓ)}_{m,2} & \dots & a^{(ℓ)}_{m,k}
\end{bmatrix}
$$

There is also a target matrix ##Y## of the same form as ##/hat{Y}##, where ##m## is the number of examples, and ##k## is the number of dependent variables in the neural network model.

$$
Y = \begin{bmatrix}
y_{1,1} & y_{1,2} & \dots & y_{1,k} \\
y_{2,1} & y_{2,2} & \dots & y_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
y_{m,1} & y_{m,2} & \dots & y_{m,k} \\
\end{bmatrix}
$$

I will now define a cost function ##C(Y;\hat{Y})##, which uses the elements in the target matrix ##Y## as constant parameters and is a function of the output matrix ##\hat{Y}##.

$$C(Y;\hat{Y})$$

Let's now put our focus on the cost function's dependent variable, the output matrix ##\hat{Y}##, and make it more explicit that the cost function is a function of all the elements in the output matrix ##\hat{Y}##, as shown below.

$$C(\hat{y}_{1,1},...,\hat{y}_{1,k},...\hat{y}_{m,1},...\hat{y}_{m,k})$$
$$C(a^{(ℓ)}_{1,1} ,...,a^{(ℓ)}_{1,k},...a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Let's now define what is ##a^{(ℓ)}_{m,k}##. This is nothing but the value of the activation function ##f## given a weighted input ##z_{m,k}## at the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$a^{(ℓ)}_{m,k}=f(z_{m,k})$$

Let's now define ##z_{m,k}##, the weighted input, where ##a^{(ℓ-1)}_{m,i}## is the activation value of the ##i^{th}## perceptron in the hidden layer ##ℓ-1## for the ##m^{th}## example, ##j## is the number of perceptrons in the hidden layer ##ℓ-1##, and ##\omega_{i,k}## is the ##i^{th}## weight for the ##k^{th}## perceptron in the output layer ##ℓ##.

$$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$

So, we want to find the value of the derivative of the cost function with respect to a weight ##\omega^{(ℓ)}_{j,k}##, which is the ##j^{th}## weight of the ##k{th}## perceptron in the output layer ##ℓ##, denoted below.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=?$$

So, first thing we can do is treat all activations ##a^{(ℓ)}_{m,k}## without a ##k## subscript as constants, since ##\omega^{(ℓ)}_{j,k}## is only relevant to the ##k^{th}## perceptron in the output layer ##ℓ##.

$$C(a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Next, we can note that $$a^{(ℓ)}_{m,k}=f(z_{m,k})$$ and $$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$, so that

$$a^{(ℓ)}_{m,k}=f(\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k})$$

and we write ##a^{(ℓ)}_{m,k}## as a function of ##\omega^{(ℓ)}_{j,k}##, like ##a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k})##. Therefore

$$C(a^{(ℓ)}_{m,1}(\omega^{(ℓ)}_{j,k}) , ... a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k}) )$$

and we can invoke the generalized chain rule.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial a^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}} $$

Now, applying the chain rule once more, we can separate this value into two parts.

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial z^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = \frac{\partial a^{(ℓ)}_{h,k}}{\partial z^{(ℓ)}_{h,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Which, we can simply to

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = f^{'}(z^{(ℓ)}_{h,k}) \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Now, let us apply this result to all of the weights in the output layer ##ℓ## for all ##k## perceptrons, defined as ##\Omega^{(ℓ)}## below, and try to find the gradient ##\nabla^{(ℓ)}_{\Omega}C##.

$$
\Omega^{(ℓ)}=
\begin{bmatrix}
\omega_{1,1} & \dots & \omega_{1,k} \\
\vdots & \ddots & \vdots \\
\omega_{j,1} & \dots & \omega_{j,k}
\end{bmatrix}
$$
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial }{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$

First, we simply plug our result in ##(1)## matrix.

$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \\
\vdots & \ddots & \vdots \\
\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}
\end{bmatrix}
$$

This is nothing but the matrix multiplication operation

$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
a^{(ℓ-1)}_{1,1} & \dots & a^{(ℓ-1)}_{m,1} \\
\vdots & \ddots & \vdots \\
a^{(ℓ-1)}_{1,j} & \dots & a^{(ℓ-1)}_{m,j} \\
\end{bmatrix}
\begin{bmatrix}
\frac{\partial C}{\partial z^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial z^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{m,k}} \\
\end{bmatrix}
$$

Which is simplified below, where ##L^{(ℓ-1)}## is the input to the output layer ##ℓ## and ##\nabla^{(ℓ)}_{z}C## is the gradient of the cost function with respect to the weighted input to the output layer.

$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C
$$

Now, let's just focus on ##\nabla^{(ℓ)}_{z}C##. We can solve ##\nabla^{(ℓ)}_{z}C## by plugging in our result ##(2)##.

$$
\nabla^{(ℓ)}_{z}C= \begin{bmatrix}
f^{'}(z^{(ℓ)}_{1,1}) \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & f^{'}(z^{(ℓ)}_{1,k}) \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
f^{'}(z^{(ℓ)}_{m,1}) \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & f^{'}(z^{(ℓ)}_{m,k}) \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$

We can split this into two matrices using the Hadamard product.

$$
\nabla^{(ℓ)}_{z}C= \begin{bmatrix}
f^{'}(z^{(ℓ)}_{1,1}) & \dots & f^{'}(z^{(ℓ)}_{1,k}) \\
\vdots & \ddots & \vdots \\
f^{'}(z^{(ℓ)}_{m,1}) & \dots & f^{'}(z^{(ℓ)}_{m,k})
\end{bmatrix}
⊙
\begin{bmatrix}
\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$

And, lo and behold, the gradient ##\nabla^{(ℓ)}_{a}C## is a matrix, as shown.

$$
\nabla^{(ℓ)}_{a}C=
\begin{bmatrix}
\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$

So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??

pasmith · Feb 28, 2021

An [itex]n \times m[/itex] matrix is equivalent to a vector of dimension [itex]nm[/itex]. Is that what's going on?

Jarvis323 · Feb 28, 2021

Zap said:

So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??

It might be just because you're using a vectorized implementation, and it's not working because you've neglected a few details (for example, the ##1/m## term)?

This video explains the vectorized implementation, and it looks similar to yours.

pbuk · Feb 28, 2021

I don't think anyone is in doubt that you can calculate such a matrix: the problem is how do you use the matrix to alter the coefficients? What use is it having ## m ## different values you need to adjust each coefficient by to make it better fit training example ## x ##? What we want is one value for each coefficient to make it a better fit on average over all the training examples.

I think @Jarvis323 nailed it in your other thread on the same subject (why start two threads?) - you have become over-focused on doing some linear algebra and lost sight of the goal. The goal is descending the cost function surface in the direction of steepest descent: we need to find the vector that points in that direction.

Zap · Feb 28, 2021

Yes. I created this thread before Jarvis323's response, which I also think nailed it. This is really an interested problem, though, I think. I'm also curious if this approach can be salvaged, since it really simplifies the algorithm.

Jarvis323 · Feb 28, 2021

Zap said:

I'm also curious if this approach can be salvaged, since it really simplifies the algorithm.

I think that what you have is very close to the vectorized implementation from the video above.

Zap · Mar 3, 2021

Actually, even if I treat ##\nabla^{(ℓ)}_{a}C## as a vector, instead of a matrix, I still derive the almost the exact same equations, which somehow differ from what is presented on Wikipedia and in Nielsen's book.

I get these equations, treating ##\nabla^{(ℓ)}_{a}C## as a vector.

$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}⊙\nabla^{(ℓ)}_{z}C
$$
$$
\nabla^{(ℓ)}_{\beta}C=\nabla^{(ℓ)}_{z}C
$$
$$
\nabla^{(ℓ)}_{z}C=F'^{(ℓ)}⊙\nabla^{(ℓ)}_{a}C
$$
$$
\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}
$$
$$
F'^{(ℓ)}=f^{'}⊙(L^{(ℓ-1)}\Omega^{(ℓ)}+\beta^{(ℓ)})
$$

And these equations, treating ##\nabla^{(ℓ)}_{a}C## as a matrix.

$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C
$$
$$
\nabla^{(ℓ)}_{\beta}C=( \nabla^{(ℓ)}_{z}C ).sum( axis = 0 )
$$
$$
\nabla^{(ℓ)}_{z}C=F'^{(ℓ)}⊙\nabla^{(ℓ)}_{a}C
$$
$$
\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}
$$
$$
F'^{(ℓ)}=f^{'}⊙(L^{(ℓ-1)}\Omega^{(ℓ)}+\beta^{(ℓ)})
$$

So, they are almost identical, but somehow completely wrong. I can derive them step by step like I did in the OP. So, I'm not sure what is going on here. Some kind of voodoo magic from Nielsen? If we look at this equation

$$
\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}
$$

Instead of this, Nielsen has

$$
\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C
$$

It's a very slight difference, but it's a pretty major difference. The operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## does not seem to be possible. ##\nabla^{(ℓ)}_{z}C## should be your typical gradient row vector, if treating ##\nabla^{(ℓ)}_{a}C## as a vector, with dimensions ##1 \times k##, where ##k## is the number of perceptrons in layer ##ℓ##. ##\Omega^{(ℓ)}## is a matrix of weights in the ##ℓ^{th}## layer, with dimensions ##j \times k##, where ##j## is the number of perceptrons in layer ##ℓ-1## and ##k## is the number of perceptrons in layer ##ℓ##. So, the dimensions would not allow ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## to be carried out.

My equations vary only slightly from Nielsen, and it may be possible that both are correct, depending whether you define a gradient as a row or column vector.

I think they are both correct. Nielsen is probably using a slightly different notation. I think my neural network was failing because I was using

$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C
$$

Instead of

$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}⊙\nabla^{(ℓ)}_{z}C
$$

Even though wikipedia says

$$
\nabla^{(ℓ)}_{\Omega}C=\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}
$$

but everything else is so close that it's probably just a difference in notation.

Edit: Sorry, the two equations treating ##\nabla^{(ℓ)}_{a}C## as a vector and matrix are actually exactly the same, since ##\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}⊙\nabla^{(ℓ)}_{z}C## is exactly the same as ##\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C##, where ⊙ I'm treating as a broadcast operation.

pbuk · Mar 3, 2021

Zap said:

So, they are almost identical, but somehow completely wrong. I can derive them step by step like I did in the OP. So, I'm not sure what is going on here. Some kind of voodoo magic from Nielsen? If we look at this equation

$$
\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}
$$

Instead of this, Nielsen has

$$
\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C
$$

It's a very slight difference, but it's a pretty major difference. The operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## does not seem to be possible.

We have:

##\nabla^{(ℓ)}_{a}C## has dimensions ##1 \times k##; and
##\Omega^{(ℓ)}## has dimensions ##j \times k##, so ##(\Omega^{(ℓ)})^T## has dimensions ##k \times j##

So the operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is ## [k \times j] \times [1 \times k] = [1 \times j] ##, which are the dimensions of ## F'^{(ℓ-1)} ## so we can take the Hadamard product.

Looks OK to me: did you miss the transposition?

Zap · Mar 3, 2021

pbuk said:

We have:

##\nabla^{(ℓ)}_{a}C## has dimensions ##1 \times k##; and

##\Omega^{(ℓ)}## has dimensions ##j \times k##, so ##(\Omega^{(ℓ)})^T## has dimensions ##k \times j##

So the operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is ## [k \times j] \times [1 \times k] = [1 \times j] ##, which are the dimensions of ## F'^{(ℓ-1)} ## so we can take the Hadamard product.

Looks OK to me: did you miss the transposition?

Hold on, does not ##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## => ## [1 \times k] \times [k \times j] = [1 \times j ]##?

Okay, I think they both work if ##\nabla^{(ℓ)}_{z}C## is a vector, but only ##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## works if ##\nabla^{(ℓ)}_{z}C## is a matrix.

pbuk · Mar 3, 2021

Yes, and Wikipedia is using matrix multiplication whereas Neilsen uses the Hadamard products because they are (quadratically) more efficient computationally.

Zap · Mar 3, 2021

Idk, the dimensions are still not working for me.

Something like ##\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}## imples##[1 \times k] \times [ j \times 1 ]##, which doesn't work.

Idk, I copy the math from Neilsen exactly, and I'm still running into matrix/vector dimension errors in the products. I changed everything to a Hadamard product, but still running into the same errors. The only formula that doesn't produce matrix/vector dimension errors when taking the products is the one I developed, but those formulas are inherently wrong. I've already tested them possibly hundreds of times, and although they carry out the products without error, they don't train a neural network.

So, I'm probably at a point in which I have to give up.

Perhaps, I'll just copy Neilsen or Wikipedia's equations verbatim and just pretend like I actually understand what's going on, and continue treating neural nets as black boxes, at least for the time being ... very sad

pbuk · Mar 3, 2021

Zap said:

Perhaps, I'll just copy Neilsen or Wikipedia's equations verbatim and just pretend like I actually understand what's going on, and continue treating neural nets as black boxes, at least for the time being ... very sad

I don't think it's that bad - you at least have deepened your conceptual understanding of the procedure even if you can't hammer the linear algebra into shape right now - Neilsen himself points to the benefits of 'detailed insights into how changing the weights and biases changes the overall behaviour of the network'.

And as you say this only needs to be for the time being - come back to it in 6 months from a different perspective and it may all fall into place.

Jarvis323 · Mar 3, 2021

In the vectorized implementation, the main difference I see with your implentation of backpropegation and a correct vectorized implementation is this:

You compute grad_w and grad_b as:

grad_w = np.matmul( A[ layer -1 ].T, grad_z )
grad_b = np.sum( grad_z, axis=0 )

And in a correct version based on the video is:

grad_w = np.matmul( grad_z, A[ layer -1 ].T ) / batch_size
grad_b = np.sum( grad_z, axis=1, keepdims=true ) / batch_size

I don't know if the other parts of yours are a little different or not, but I think it's the same algorithm you're implementing.

Screenshot from the video with the vectorized implementation on the right and the regular implementation on the left (note you're using batches so m should be the batch size):

screenshot of your code:

By the way, I've found Andrew NG's videos to quite good and easy to understand.

Zap · Mar 3, 2021

When I try grad_w = np.matmul( A[ layer -1 ].T, grad_z ) , I get an error saying the dimensions are not compatible. A[ layer -1 ] should be a row vector, and grad_z should also be a row vector, but the number of elements differ, because A[ layer -1 ] is from layer - 1 and grad_z is from the layer in front of it. You can broadcast them together, but I run into problems with the dimensions later on after doing that.

Jarvis323 · Mar 3, 2021

Zap said:

When I try grad_w = np.matmul( A[ layer -1 ].T, grad_z ) , I get an error saying the dimensions are not compatible. A[ layer -1 ] should be a row vector, and grad_z should also be a row vector, but the number of elements differ, because A[ layer -1 ] is from layer - 1 and grad_z is from the layer in front of it. You can broadcast them together, but I run into problems with the dimensions later on after doing that.

You might have rows and columns swapped compared to Andrew NG's version?

Zap · Mar 3, 2021

No .. if that were the case, than my math would be correct, and I'm assuming the training algorithm would then work.

Jarvis323 · Mar 3, 2021

Zap said:

No .. if that were the case, than my math would be correct, and I'm assuming the training algorithm would then work.

But you didn't normalize by the batch size also right?

pbuk · Mar 3, 2021

Zap said:

No .. if that were the case, than my math would be correct, and I'm assuming the training algorithm would then work.

Maybe, although it could be that you are being caught out by some unexpected feature of the way numpy.array overloads the * operator (e.g. you might be getting an array of Hadamard products instead of a matrix product).

Zap · Mar 3, 2021

I'm just trying to get it to work with Neilsen's math using a batch size of 1 right now.

Zap · Mar 3, 2021

##\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}## is not allowed in numpy dot or numpy matmul. It will result in an error. ##\nabla^{(ℓ)}_{z}C## has dimensions ##1 \times k## and ##(L^{(ℓ-1)})## has dimensions ##j \times 1##.

##(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C## is allowed. It will not result in an error.

##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is not allowed in numpy dot or numpy matmul. It will result in an error. ##(\Omega^{(ℓ)})^{T}## has dimensions ##k \times j## and ##\nabla^{(ℓ)}_{z}C## has dimensions ## 1 \times k##.

##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## is allowed. It will not result in an error.

So, I'm kind of at a loss here. If I use Neilsen's math I get nothing but errors back. If I use my math, everything seems to go fine until I output a matrix containing identical vectors, usually all zeros, after training. So, what in the heck is going on? lol

If I copy and paste Neilsen's code, it also produces an error, but something different than a matrix multiplication error, if I remember correctly. I guess I will try to debug his code.

Jarvis323 · Mar 3, 2021

Zap said:

##\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}## is not allowed in numpy dot or numpy matmul. It will result in an error. ##\nabla^{(ℓ)}_{z}C## has dimensions ##1 \times k## and ##(L^{(ℓ-1)})## has dimensions ##j \times 1##.

##(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C## is allowed. It will not result in an error.

##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is not allowed in numpy dot or numpy matmul. It will result in an error. ##(\Omega^{(ℓ)})^{T}## has dimensions ##k \times j## and ##\nabla^{(ℓ)}_{z}C## has dimensions ## 1 \times k##.

##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## is allowed. It will not result in an error.

So, I'm kind of at a loss here. If I use Neilsen's math I get nothing but errors back. If I use my math, everything seems to go fine until I output a matrix containing identical vectors, usually all zeros, after training. So, what in the heck is going on? lol

If I copy and paste Neilsen's code, it also produces an error, but something different than a matrix multiplication error, if I remember correctly. I guess I will try to debug his code.

It would be easier for me to help if you posted the code you're using.

Zap · Mar 3, 2021

Nielsons' website: http://neuralnetworksanddeeplearning.com/chap1.html

my website: https://zachary-chiodini.github.io/neural-network/feedforwardneuralnetwork.html

I give up for today. Nielson's code is giving me an overflow error, even if I make the learning rate like 0.0000001, and debugging someone else's code is like trying to find errors in someone else's math, no fun. You guys are putting up with both ha!

Edit: I've included the application of the network on my website.

pbuk · Mar 4, 2021

You appear to be using a learning rate of 3 in your own application which is about 300x too big, and are you also repeating training with the same data until you get 'convergence'? How does that work?

Zap · Mar 4, 2021

The "convergence" thing is just one of three stopping conditions. It will stop training after a max time, after a maximum number of epochs, or when the matrix norm of the gradient is less than a "convergence," since the learning should be complete when the gradient is nearly zero, or I guess when the magnitude of the gradient is nearly zero.

The learning rate of 3 should be okay. If it's too large, I set numpy to raise an overflow error. Nielson is also using 3.

pbuk · Mar 4, 2021

Zap said:

The learning rate of 3 should be okay. If it's too large, I set numpy to raise an overflow error. Nielson is also using 3.

So he does. I was thrown by your line

Code:

In [31] network.SGD( train, 30, 10, 0.01 )

where you use 0.01 for Nielsen's algorithm.

Zap · Mar 4, 2021

That's because I keep getting an overflow error with Nielsen's algorithm, which is typically due to a learning rate that is too high, and therefore over shoots the minima and then explodes.

Zap · Mar 5, 2021

I guess the icing on the cake here is that Nielsen's algorithm does not seem to work, either. I have noticed, however, that Nielsen is defining the data structure for his weights and biases differently than I. His biases are a column vector, where mine are a row vector. Similarly, the dimensions for his weight matrices are the reverse of mine. This explains the differences in our formulas. So, we appear to be using the same mathematics.

I preferred to make each column in the weight matrix represent a perceptron, where he made each row represent a perceptron. The latter is a bit strange to me, because we might want to denote a weight ##\omega_{j,k}## as a coefficient of the ##j^{th}## perceptron's output in layer ##ℓ - 1## and belonging to the ##k^{th}## perceptron in layer ##ℓ##.

So, when making a matrix such as below, indexed as a matrix should be, it automatically makes each column a perceptron.

$$
\Omega = \begin{bmatrix}
\omega_{1,1} & \omega_{1,2} & \dots & \omega_{1,k} \\
\omega_{2,1} & \omega_{2,2} & \dots & \omega_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
\omega_{j,1} & \omega_{j,2} & \dots & \omega_{j,k}
\end{bmatrix}
$$

Also, you don't want to make the biases a column vector, because at the end of the forward propagation, your aim is to output a row vector, not a column vector. This is actually one of the reasons why Nielsen's code does not work.

Anyway, that's not a huge issue, and I've finally discovered that I guess the math I've been using is not any different than what Nielsen is using.

The only difference is that Nielsen is using the transpose of the weight matrix that I'm using, which explains everything.

$$
\Omega \rightarrow (\Omega)^{T}
$$

However, still, nothing appears to be working lol.

I'm surprised to find that Nielsen's code is bug-ridden. Even his forward propagation function does not work. Treating the biases as a column vector creates some odd matrix output given a single input vector, which makes no sense. I feel like debugging Nielsen's code would reproduce what I've already written and found to not work.

So, I think I'm on my own with this one. Time to start thinking outside of the box.

Zap · Mar 6, 2021

HOLY CRAP GANG IT WORKS!

Wow! I let it train for a good 5-10 minutes, and IT WORKED!

That's crazy. My code was good the whole time, and so is Nielson's (after debugging).

There was probably never anything wrong with my program. I just didn't let it run long enough ...

I swear I had let it run for this long before, though, and it didn't work. I have included ##\frac{1}{m}## as a coefficient in the matrix as Jarvis suggested. I think that may have been the trick, but including ##\frac{1}{m}## is kind of equivalent to changing the learning rate, if I'm not mistaken. So, it should work without it.

What was very confusing is that Nielson claimed to get like 90% accuracy after only 30 epochs. That's not possible. You need like 500 and wait like 5 - 10 minutes to get a descent result ...

Anyway, sorry guys.

Thanks for helping me out, though.

IT WORKS!

You know, I bet there is a saddle point or some kind of local minima that the algorithm was getting stuck on, because I kept getting almost the identical result over and over again. It's a little odd that Nielson is not mentioning this in his book. 30 epochs is enough to settle into this weird saddle point or local minima. It's not enough to train the model.

Additionally, the cost function does not have to be a mean. I'm using the sum of squared residuals. Idk, you guys will probably call me a brat, but I think Nielson is putting incorrect info into his book.

I will be deleting my website, now. I feel as though I have probably built one of the most straight forward feedforward neural nets in Python, or at least that I have ever seen. So, it must be deleted.

Edit: It was actually working this entire time. You don't even have to take an average or do the gradient one vector at a time. You don't have to do any of that. Reading Neilson's book actually set me back and made me think there was something wrong with my program. Well, if you try to run his program, you'll see there is something wrong with his. Anyway, my neural net works, and I couldn't be more relieved. You don't have to take an average of the gradient per input vector. You can treat the gradient as a matrix. Also, making the biases a column vector and thus using the math in Neilson's book makes no sense. That's why his program does not work.

Sorry, but Neilson is wrong, and this thread proves that he is wrong. His information is copied and distributed all over the place, even on Wikipedia, and it's all incorrect. It doesn't work.

Check my updated website while it's still available. My way is the only way that works, and I prove it there. If you try Neilson's code or methods, it will not work. He is so wrong. This is an amazing discovery.

This code is available on https://zachary-chiodini.github.io/neural-network/feedforwardneuralnetwork.html.

Zap · Mar 6, 2021

Just want to add one last thing. There's definitely a saddle point or a local minima where the algorithm is getting stuck. I know this because today the program is no longer working, and giving me the same model back every run, once again. I can't see how this is possible unless it is encountering a saddle point or local minima. I hypothesize that the initialization of the random weights may determine whether the model gets stuck at this local minima or not.

I've initialized the weights to be a uniform random variable from -1 to 1, instead of from 0 to 1, and my program has produced a working model, again. I cannot state more emphatically that Neilson's book is totally wrong. If you follow his information, your program will not function at all. The problem this entire time was a saddle or local minima that was encountered because of the way the random weights were initiated. The math I developed was correct, and the algorithm I developed was correct. What I did not anticipate is that there is faulty information on the internet that seems to be almost universally accepted as the truth.

pbuk · Mar 6, 2021

I'm glad you got your code working. I'm not sure that you are correct in saying that the code in Neilsen's book is totally wrong though - it has over 12 thousand stars and 5 thousand forks on GitHub and only 3 outstanding pull requests. The author does note that the code is written for Python 2 and he is not updating it or accepting PRs for Python 3: this might be your problem? It looks as though it is easy enough to fix for Python 3 from the PR comments, and there appear to be a couple of v3 forks around.

I also wonder if your training issues are affected by your use of the sum of squared residuals as a cost function. I have no experience in this context, but I do often find that squared errors are unsatisfactory loss functions compared to linear errors (also see for instance Berger Statistical decision theory and Bayesian analysis s2.4.2) due to the emphasis on outliers. The only benefit I can ever see in squared errors are that the maths works better with continuous derivatives: given the choice between good maths and a good model I always go for the latter.

Zap · Mar 6, 2021

It might be true that squared errors are not as good. The point is that they can be used to train a neural a net. It doesn't have to be a mean squared error. I had to effectively turn Neilson's code into mine to get it to work (minus using the average of costs per input vector), because making the biases a column vector makes numpy broadcast a row vector with a column vector each recursive step in the forward propagation function. Even if the older version of numpy allowed this kind of addition without broadcasting, it still doesn't make much sense mathematically.

I was a little frustrated at Neilson for this. I would personally rate his book very low because of this. It could be possible that I had misinterpreted something in his book, but according to me, it contains faulty information that did the opposite of aid me in understanding neural networks. That's just my experience.

My matrix math is the same as his, except I don't break everything up into single vectors and average them. I simply allow the Numpy matrix operations to add everything up for me, and then I absorb the coefficient ##\frac{1}{batch-size}## into the learning rate. So, it's effectively doing the same thing. In my opinion, my way is less confusing. It's also faster. But, of course, these are my opinions. I would be surprised if anyone truly walked away from that book with a solid understanding of what's really going on, but I'm sure it's a possibility.

Thanks for your help, though! This was a super difficult problem, and decoding it has taught me a lot.

FYI: As I mentioned before, even after debugging Neilson's code, his program gave the same result as mine. His also encounters the local minima, because he initializes his weights as random values between 0 to 1, which is the wrong approach for this problem. He probably got lucky when he ran his code, as I did last night, and the random weights just happened to initialize in a very lucky way to circumvent the local minima. Initializing the weights and biases as random values between -1 and 1 or -0.5 and 0.5 will do much better.

O, I should add, that you can get a descent model after only a single epoch of training, only if that local minima is not encountered. Adjusting the weight to random values between -0.5 and 0.5 seems to avoid the minima in all tests so far.

Zap · Mar 6, 2021

There's nothing wrong with my math/code. There was a local minima. Neilson's code encounters the same minima. He just got lucky when he ran his code and randomly produced the right weights at initialization and avoided it. And yes, the batch size will produce an overflow error if the learning rate is not simultaneously decreased. I's not a huge issue, but it's probably better to include the ##\frac{1}{batchsize}## now that you mention it.

I would say the main issue with Neilson's math/code is making the biases a column vector. It doesn't make any sense to do that, and that's why his feedforward function does not work. Making the biases a row vector will reproduce the math that I derived, which does work. Also, you don't have to average each input vector one at a time in a for loop, as Neilson does. You can do it all with a matrix multiplication, and then slap ##\frac{1}{batchsize}## in front of it. This is way faster, less complicated, and is accomplishing the same thing.

But don't worry, I'm putting together my own tutorial. It will explain everything about the basic feed forward neural net, and it will only require 10 minutes of your time. I will make a special shoutout to the physics forum, as well, since you guys have contributed quite a bit to this. It was neat to see the different ways a neural net could be interpreted and the problems that were encountered, and how the solution was finally obtained. It was a local minima! That's pretty amazing. I don't think anyone was thinking the problem was that. I am a bit disappointed with Neilson, though. But, I'm often disappointed with tutorials in general, hence why I feel like it may be profitable for me to make my own.

Derivative of Cost Function with Respect to Output Layer Weight

What is the derivative of the cost function with respect to the output layer weight?

Why is the derivative of the cost function with respect to the output layer weight important?

How is the derivative of the cost function with respect to the output layer weight calculated?

What is the role of the output layer weight in the cost function?

How does the derivative of the cost function with respect to the output layer weight affect the learning process?

Similar threads

Hot Threads

Recent Insights