# Derivative of Cost Function with Respect to Output Layer Weight

Summary:
Deriving the gradient of the cost function ##\nabla_{a}C##.
This is an issue I've been stuck on for about two weeks. No matter how many times I take this derivative, I keep getting the same answer. However, this answer is inevitably wrong. Please help me to understand why it incorrect.

To start, I will define an input matrix ##X##, where ##n## is the number of features or independent variables in the neural network model, and ##m## is the number of examples.

$$X = \begin{bmatrix} x_{1,1} & x_{1,2} & \dots & x_{1,n} \\ x_{2,1} & x_{2,2} & \dots & x_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m,1} & x_{m,2} & \dots & x_{m,n} \end{bmatrix}$$

Next, I will define an output matrix, which is the output of a feedforward neural network after forward propagation of the input matrix ##X##. In other words, this is the model's output ##\hat{Y}##, where ##m## is the number of examples in the input matrix, and ##k## is the number of perceptrons in the output layer.

$$\hat{Y} = \begin{bmatrix} \hat{y}_{1,1} & \hat{y}_{1,2} & \dots & \hat{y}_{1,k} \\ \hat{y}_{2,1} & \hat{y}_{2,2} & \dots & \hat{y}_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ \hat{y}_{m,1} & \hat{y}_{m,2} & \dots & \hat{y}_{m,k} \\ \end{bmatrix}$$

The output matrix can also be written as a matrix of the activation or output of each perceptron in the output layer for each element ##/hat{y}_{m,k}## in the output vector for all examples ##m##, as seen below, where ##a^{(ℓ)}_{m,k}## is the activation value of the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$\hat{Y} = \begin{bmatrix} a^{(ℓ)}_{1,1} & a^{(ℓ)}_{1,2} & \dots & a^{(ℓ)}_{1,k} \\ a^{(ℓ)}_{2,1} & a^{(ℓ)}_{2,2} & \dots & a^{(ℓ)}_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ a^{(ℓ)}_{m,1} & a^{(ℓ)}_{m,2} & \dots & a^{(ℓ)}_{m,k} \end{bmatrix}$$

There is also a target matrix ##Y## of the same form as ##/hat{Y}##, where ##m## is the number of examples, and ##k## is the number of dependent variables in the neural network model.

$$Y = \begin{bmatrix} y_{1,1} & y_{1,2} & \dots & y_{1,k} \\ y_{2,1} & y_{2,2} & \dots & y_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ y_{m,1} & y_{m,2} & \dots & y_{m,k} \\ \end{bmatrix}$$

I will now define a cost function ##C(Y;\hat{Y})##, which uses the elements in the target matrix ##Y## as constant parameters and is a function of the output matrix ##\hat{Y}##.

$$C(Y;\hat{Y})$$

Let's now put our focus on the cost function's dependent variable, the output matrix ##\hat{Y}##, and make it more explicit that the cost function is a function of all the elements in the output matrix ##\hat{Y}##, as shown below.

$$C(\hat{y}_{1,1},...,\hat{y}_{1,k},...\hat{y}_{m,1},...\hat{y}_{m,k})$$
$$C(a^{(ℓ)}_{1,1} ,...,a^{(ℓ)}_{1,k},...a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Let's now define what is ##a^{(ℓ)}_{m,k}##. This is nothing but the value of the activation function ##f## given a weighted input ##z_{m,k}## at the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$a^{(ℓ)}_{m,k}=f(z_{m,k})$$

Let's now define ##z_{m,k}##, the weighted input, where ##a^{(ℓ-1)}_{m,i}## is the activation value of the ##i^{th}## perceptron in the hidden layer ##ℓ-1## for the ##m^{th}## example, ##j## is the number of perceptrons in the hidden layer ##ℓ-1##, and ##\omega_{i,k}## is the ##i^{th}## weight for the ##k^{th}## perceptron in the output layer ##ℓ##.

$$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$

So, we want to find the value of the derivative of the cost function with respect to a weight ##\omega^{(ℓ)}_{j,k}##, which is the ##j^{th}## weight of the ##k{th}## perceptron in the output layer ##ℓ##, denoted below.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=?$$

So, first thing we can do is treat all activations ##a^{(ℓ)}_{m,k}## without a ##k## subscript as constants, since ##\omega^{(ℓ)}_{j,k}## is only relevant to the ##k^{th}## perceptron in the output layer ##ℓ##.

$$C(a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Next, we can note that $$a^{(ℓ)}_{m,k}=f(z_{m,k})$$ and $$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$, so that

$$a^{(ℓ)}_{m,k}=f(\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k})$$

and we write ##a^{(ℓ)}_{m,k}## as a function of ##\omega^{(ℓ)}_{j,k}##, like ##a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k})##. Therefore

$$C(a^{(ℓ)}_{m,1}(\omega^{(ℓ)}_{j,k}) , ... a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k}) )$$

and we can invoke the generalized chain rule.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial a^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Now, applying the chain rule once more, we can separate this value into two parts.

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial z^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}$$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = \frac{\partial a^{(ℓ)}_{h,k}}{\partial z^{(ℓ)}_{h,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Which, we can simply to

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}$$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = f^{'}(z^{(ℓ)}_{h,k}) \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Now, let us apply this result to all of the weights in the output layer ##ℓ## for all ##k## perceptrons, defined as ##\Omega^{(ℓ)}## below, and try to find the gradient ##\nabla^{(ℓ)}_{\Omega}C##.

$$\Omega^{(ℓ)}= \begin{bmatrix} \omega_{1,1} & \dots & \omega_{1,k} \\ \vdots & \ddots & \vdots \\ \omega_{j,1} & \dots & \omega_{j,k} \end{bmatrix}$$
$$\nabla^{(ℓ)}_{\Omega}C= \begin{bmatrix} \frac{\partial }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial }{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{j,k}} \end{bmatrix}$$

First, we simply plug our result in ##(1)## matrix.

$$\nabla^{(ℓ)}_{\Omega}C= \begin{bmatrix} \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \\ \vdots & \ddots & \vdots \\ \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \end{bmatrix}$$

This is nothing but the matrix multiplication operation

$$\nabla^{(ℓ)}_{\Omega}C= \begin{bmatrix} a^{(ℓ-1)}_{1,1} & \dots & a^{(ℓ-1)}_{m,1} \\ \vdots & \ddots & \vdots \\ a^{(ℓ-1)}_{1,j} & \dots & a^{(ℓ-1)}_{m,j} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial C}{\partial z^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial C}{\partial z^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{m,k}} \\ \end{bmatrix}$$

Which is simplified below, where ##L^{(ℓ-1)}## is the input to the output layer ##ℓ## and ##\nabla^{(ℓ)}_{z}C## is the gradient of the cost function with respect to the weighted input to the output layer.

$$\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C$$

Now, let's just focus on ##\nabla^{(ℓ)}_{z}C##. We can solve ##\nabla^{(ℓ)}_{z}C## by plugging in our result ##(2)##.

$$\nabla^{(ℓ)}_{z}C= \begin{bmatrix} f^{'}(z^{(ℓ)}_{1,1}) \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & f^{'}(z^{(ℓ)}_{1,k}) \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ f^{'}(z^{(ℓ)}_{m,1}) \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & f^{'}(z^{(ℓ)}_{m,k}) \frac{\partial C}{\partial a^{(ℓ)}_{m,k}} \end{bmatrix}$$

We can split this into two matrices using the Hadamard product.

$$\nabla^{(ℓ)}_{z}C= \begin{bmatrix} f^{'}(z^{(ℓ)}_{1,1}) & \dots & f^{'}(z^{(ℓ)}_{1,k}) \\ \vdots & \ddots & \vdots \\ f^{'}(z^{(ℓ)}_{m,1}) & \dots & f^{'}(z^{(ℓ)}_{m,k}) \end{bmatrix} \begin{bmatrix} \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}} \end{bmatrix}$$

And, lo and behold, the gradient ##\nabla^{(ℓ)}_{a}C## is a matrix, as shown.

$$\nabla^{(ℓ)}_{a}C= \begin{bmatrix} \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}} \end{bmatrix}$$

So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??

## Answers and Replies

pasmith
Homework Helper
An $n \times m$ matrix is equivalent to a vector of dimension $nm$. Is that what's going on?

So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??
It might be just because you're using a vectorized implementation, and it's not working because you've neglected a few details (for example, the ##1/m## term)?

This video explains the vectorized implementation, and it looks similar to yours.

Last edited:
pbuk
Gold Member
I don't think anyone is in doubt that you can calculate such a matrix: the problem is how do you use the matrix to alter the coefficients? What use is it having ## m ## different values you need to adjust each coefficient by to make it better fit training example ## x ##? What we want is one value for each coefficient to make it a better fit on average over all the training examples.

I think @Jarvis323 nailed it in your other thread on the same subject (why start two threads?) - you have become over-focused on doing some linear algebra and lost sight of the goal. The goal is descending the cost function surface in the direction of steepest descent: we need to find the vector that points in that direction.

Yes. I created this thread before Jarvis323's response, which I also think nailed it. This is really an interested problem, though, I think. I'm also curious if this approach can be salvaged, since it really simplifies the algorithm.

I'm also curious if this approach can be salvaged, since it really simplifies the algorithm.
I think that what you have is very close to the vectorized implementation from the video above.

Actually, even if I treat ##\nabla^{(ℓ)}_{a}C## as a vector, instead of a matrix, I still derive the almost the exact same equations, which somehow differ from what is presented on Wikipedia and in Nielsen's book.

I get these equations, treating ##\nabla^{(ℓ)}_{a}C## as a vector.

$$\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}⊙\nabla^{(ℓ)}_{z}C$$
$$\nabla^{(ℓ)}_{\beta}C=\nabla^{(ℓ)}_{z}C$$
$$\nabla^{(ℓ)}_{z}C=F'^{(ℓ)}⊙\nabla^{(ℓ)}_{a}C$$
$$\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}$$
$$F'^{(ℓ)}=f^{'}⊙(L^{(ℓ-1)}\Omega^{(ℓ)}+\beta^{(ℓ)})$$

And these equations, treating ##\nabla^{(ℓ)}_{a}C## as a matrix.

$$\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C$$
$$\nabla^{(ℓ)}_{\beta}C=( \nabla^{(ℓ)}_{z}C ).sum( axis = 0 )$$
$$\nabla^{(ℓ)}_{z}C=F'^{(ℓ)}⊙\nabla^{(ℓ)}_{a}C$$
$$\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}$$
$$F'^{(ℓ)}=f^{'}⊙(L^{(ℓ-1)}\Omega^{(ℓ)}+\beta^{(ℓ)})$$

So, they are almost identical, but somehow completely wrong. I can derive them step by step like I did in the OP. So, I'm not sure what is going on here. Some kind of voodoo magic from Nielsen? If we look at this equation

$$\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}$$

Instead of this, Nielsen has

$$\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C$$

It's a very slight difference, but it's a pretty major difference. The operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## does not seem to be possible. ##\nabla^{(ℓ)}_{z}C## should be your typical gradient row vector, if treating ##\nabla^{(ℓ)}_{a}C## as a vector, with dimensions ##1 \times k##, where ##k## is the number of perceptrons in layer ##ℓ##. ##\Omega^{(ℓ)}## is a matrix of weights in the ##ℓ^{th}## layer, with dimensions ##j \times k##, where ##j## is the number of perceptrons in layer ##ℓ-1## and ##k## is the number of perceptrons in layer ##ℓ##. So, the dimensions would not allow ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## to be carried out.

My equations vary only slightly from Nielsen, and it may be possible that both are correct, depending whether you define a gradient as a row or column vector.

I think they are both correct. Nielsen is probably using a slightly different notation. I think my neural network was failing because I was using

$$\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C$$

$$\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}⊙\nabla^{(ℓ)}_{z}C$$

Even though wikipedia says

$$\nabla^{(ℓ)}_{\Omega}C=\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}$$

but everything else is so close that it's probably just a difference in notation.

Edit: Sorry, the two equations treating ##\nabla^{(ℓ)}_{a}C## as a vector and matrix are actually exactly the same, since ##\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}⊙\nabla^{(ℓ)}_{z}C## is exactly the same as ##\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C##, where ⊙ I'm treating as a broadcast operation.

Last edited:
pbuk
Gold Member
So, they are almost identical, but somehow completely wrong. I can derive them step by step like I did in the OP. So, I'm not sure what is going on here. Some kind of voodoo magic from Nielsen? If we look at this equation

$$\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}$$

Instead of this, Nielsen has

$$\nabla^{(ℓ-1)}_{z}C=F'^{(ℓ-1)}⊙(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C$$

It's a very slight difference, but it's a pretty major difference. The operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## does not seem to be possible.
We have:
• ##\nabla^{(ℓ)}_{a}C## has dimensions ##1 \times k##; and
• ##\Omega^{(ℓ)}## has dimensions ##j \times k##, so ##(\Omega^{(ℓ)})^T## has dimensions ##k \times j##
So the operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is ## [k \times j] \times [1 \times k] = [1 \times j] ##, which are the dimensions of ## F'^{(ℓ-1)} ## so we can take the Hadamard product.

Looks OK to me: did you miss the transposition?

We have:
• ##\nabla^{(ℓ)}_{a}C## has dimensions ##1 \times k##; and
• ##\Omega^{(ℓ)}## has dimensions ##j \times k##, so ##(\Omega^{(ℓ)})^T## has dimensions ##k \times j##
So the operation ##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is ## [k \times j] \times [1 \times k] = [1 \times j] ##, which are the dimensions of ## F'^{(ℓ-1)} ## so we can take the Hadamard product.

Looks OK to me: did you miss the transposition?

Hold on, does not ##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## => ## [1 \times k] \times [k \times j] = [1 \times j ]##?

Okay, I think they both work if ##\nabla^{(ℓ)}_{z}C## is a vector, but only ##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## works if ##\nabla^{(ℓ)}_{z}C## is a matrix.

Last edited:
pbuk
pbuk
Gold Member
Yes, and Wikipedia is using matrix multiplication whereas Neilsen uses the Hadamard products because they are (quadratically) more efficient computationally.

Idk, the dimensions are still not working for me.

Something like ##\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}## imples##[1 \times k] \times [ j \times 1 ]##, which doesn't work.

Idk, I copy the math from Neilsen exactly, and I'm still running into matrix/vector dimension errors in the products. I changed everything to a Hadamard product, but still running into the same errors. The only formula that doesn't produce matrix/vector dimension errors when taking the products is the one I developed, but those formulas are inherently wrong. I've already tested them possibly hundreds of times, and although they carry out the products without error, they don't train a neural network.

So, I'm probably at a point in which I have to give up.

Perhaps, I'll just copy Neilsen or Wikipedia's equations verbatim and just pretend like I actually understand what's going on, and continue treating neural nets as black boxes, at least for the time being ... very sad

Last edited:
pbuk
Gold Member
Perhaps, I'll just copy Neilsen or Wikipedia's equations verbatim and just pretend like I actually understand what's going on, and continue treating neural nets as black boxes, at least for the time being ... very sad
I don't think it's that bad - you at least have deepened your conceptual understanding of the procedure even if you can't hammer the linear algebra into shape right now - Neilsen himself points to the benefits of 'detailed insights into how changing the weights and biases changes the overall behaviour of the network'.

And as you say this only needs to be for the time being - come back to it in 6 months from a different perspective and it may all fall into place.

In the vectorized implementation, the main difference I see with your implentation of backpropegation and a correct vectorized implementation is this:

grad_w = np.matmul( A[ layer -1 ].T, grad_z )

And in a correct version based on the video is:

grad_w = np.matmul( grad_z, A[ layer -1 ].T ) / batch_size
grad_b = np.sum( grad_z, axis=1, keepdims=true ) / batch_size

I don't know if the other parts of yours are a little different or not, but I think it's the same algorithm you're implementing.

Screenshot from the video with the vectorized implementation on the right and the regular implementation on the left (note you're using batches so m should be the batch size):

screenshot of your code:

By the way, I've found Andrew NG's videos to quite good and easy to understand.

When I try grad_w = np.matmul( A[ layer -1 ].T, grad_z ) , I get an error saying the dimensions are not compatible. A[ layer -1 ] should be a row vector, and grad_z should also be a row vector, but the number of elements differ, because A[ layer -1 ] is from layer - 1 and grad_z is from the layer in front of it. You can broadcast them together, but I run into problems with the dimensions later on after doing that.

When I try grad_w = np.matmul( A[ layer -1 ].T, grad_z ) , I get an error saying the dimensions are not compatible. A[ layer -1 ] should be a row vector, and grad_z should also be a row vector, but the number of elements differ, because A[ layer -1 ] is from layer - 1 and grad_z is from the layer in front of it. You can broadcast them together, but I run into problems with the dimensions later on after doing that.
You might have rows and columns swapped compared to Andrew NG's version?

No .. if that were the case, than my math would be correct, and I'm assuming the training algorithm would then work.

No .. if that were the case, than my math would be correct, and I'm assuming the training algorithm would then work.
But you didn't normalize by the batch size also right?

pbuk
Gold Member
No .. if that were the case, than my math would be correct, and I'm assuming the training algorithm would then work.
Maybe, although it could be that you are being caught out by some unexpected feature of the way numpy.array overloads the * operator (e.g. you might be getting an array of Hadamard products instead of a matrix product).

I'm just trying to get it to work with Neilsen's math using a batch size of 1 right now.

##\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}## is not allowed in numpy dot or numpy matmul. It will result in an error. ##\nabla^{(ℓ)}_{z}C## has dimensions ##1 \times k## and ##(L^{(ℓ-1)})## has dimensions ##j \times 1##.

##(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C## is allowed. It will not result in an error.

##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is not allowed in numpy dot or numpy matmul. It will result in an error. ##(\Omega^{(ℓ)})^{T}## has dimensions ##k \times j## and ##\nabla^{(ℓ)}_{z}C## has dimensions ## 1 \times k##.

##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## is allowed. It will not result in an error.

So, I'm kind of at a loss here. If I use Neilsen's math I get nothing but errors back. If I use my math, everything seems to go fine until I output a matrix containing identical vectors, usually all zeros, after training. So, what in the heck is going on? lol

If I copy and paste Neilsen's code, it also produces an error, but something different than a matrix multiplication error, if I remember correctly. I guess I will try to debug his code.

##\nabla^{(ℓ)}_{z}C(L^{(ℓ-1)})^{T}## is not allowed in numpy dot or numpy matmul. It will result in an error. ##\nabla^{(ℓ)}_{z}C## has dimensions ##1 \times k## and ##(L^{(ℓ-1)})## has dimensions ##j \times 1##.

##(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C## is allowed. It will not result in an error.

##(\Omega^{(ℓ)})^{T}\nabla^{(ℓ)}_{z}C## is not allowed in numpy dot or numpy matmul. It will result in an error. ##(\Omega^{(ℓ)})^{T}## has dimensions ##k \times j## and ##\nabla^{(ℓ)}_{z}C## has dimensions ## 1 \times k##.

##\nabla^{(ℓ)}_{z}C(\Omega^{(ℓ)})^{T}## is allowed. It will not result in an error.

So, I'm kind of at a loss here. If I use Neilsen's math I get nothing but errors back. If I use my math, everything seems to go fine until I output a matrix containing identical vectors, usually all zeros, after training. So, what in the heck is going on? lol

If I copy and paste Neilsen's code, it also produces an error, but something different than a matrix multiplication error, if I remember correctly. I guess I will try to debug his code.
It would be easier for me to help if you posted the code you're using.

Last edited:
pbuk
Gold Member
You appear to be using a learning rate of 3 in your own application which is about 300x too big, and are you also repeating training with the same data until you get 'convergence'? How does that work?

The "convergence" thing is just one of three stopping conditions. It will stop training after a max time, after a maximum number of epochs, or when the matrix norm of the gradient is less than a "convergence," since the learning should be complete when the gradient is nearly zero, or I guess when the magnitude of the gradient is nearly zero.

The learning rate of 3 should be okay. If it's too large, I set numpy to raise an overflow error. Nielson is also using 3.

pbuk
pbuk
In [31] network.SGD( train, 30, 10, 0.01 )