Derivative of Cost Function with Respect to Output Layer Weight

Click For Summary

Discussion Overview

The discussion revolves around the derivation of the cost function with respect to the weights of the output layer in a neural network. Participants explore the mathematical formulation and application of the chain rule in this context, focusing on the relationships between the input, output, and cost matrices.

Discussion Character

  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant describes the structure of the input matrix ##X## and the output matrix ##\hat{Y}##, emphasizing the dimensions and roles of each in the neural network.
  • Another participant defines the cost function ##C(Y;\hat{Y})## and expresses it in terms of the output matrix elements, indicating the dependence on the activations of the output layer perceptrons.
  • There is a discussion on the weighted input ##z_{m,k}## and its relationship to the activations, with a focus on how to express the derivative of the cost function with respect to the weights ##\omega^{(ℓ)}_{j,k}##.
  • Participants explore the application of the chain rule to derive the gradient of the cost function, breaking it down into components related to the activations and the weighted inputs.
  • There is a proposal to express the gradient of the cost function in matrix form, linking the input to the output layer and the gradients of the cost function with respect to the weighted inputs.

Areas of Agreement / Disagreement

Participants generally agree on the structure of the matrices involved and the application of the chain rule, but there is no consensus on the correctness of the derived expressions or the final form of the gradient of the cost function.

Contextual Notes

The discussion includes complex mathematical expressions and assumptions about the activation functions and their derivatives, which may not be fully resolved or agreed upon by all participants.

  • #31
It might be true that squared errors are not as good. The point is that they can be used to train a neural a net. It doesn't have to be a mean squared error. I had to effectively turn Neilson's code into mine to get it to work (minus using the average of costs per input vector), because making the biases a column vector makes numpy broadcast a row vector with a column vector each recursive step in the forward propagation function. Even if the older version of numpy allowed this kind of addition without broadcasting, it still doesn't make much sense mathematically.

I was a little frustrated at Neilson for this. I would personally rate his book very low because of this. It could be possible that I had misinterpreted something in his book, but according to me, it contains faulty information that did the opposite of aid me in understanding neural networks. That's just my experience.

My matrix math is the same as his, except I don't break everything up into single vectors and average them. I simply allow the Numpy matrix operations to add everything up for me, and then I absorb the coefficient ##\frac{1}{batch-size}## into the learning rate. So, it's effectively doing the same thing. In my opinion, my way is less confusing. It's also faster. But, of course, these are my opinions. I would be surprised if anyone truly walked away from that book with a solid understanding of what's really going on, but I'm sure it's a possibility.

Thanks for your help, though! This was a super difficult problem, and decoding it has taught me a lot.

FYI: As I mentioned before, even after debugging Neilson's code, his program gave the same result as mine. His also encounters the local minima, because he initializes his weights as random values between 0 to 1, which is the wrong approach for this problem. He probably got lucky when he ran his code, as I did last night, and the random weights just happened to initialize in a very lucky way to circumvent the local minima. Initializing the weights and biases as random values between -1 and 1 or -0.5 and 0.5 will do much better.

O, I should add, that you can get a descent model after only a single epoch of training, only if that local minima is not encountered. Adjusting the weight to random values between -0.5 and 0.5 seems to avoid the minima in all tests so far.
 
Last edited:
Technology news on Phys.org
  • #32
There's nothing wrong with my math/code. There was a local minima. Neilson's code encounters the same minima. He just got lucky when he ran his code and randomly produced the right weights at initialization and avoided it. And yes, the batch size will produce an overflow error if the learning rate is not simultaneously decreased. I's not a huge issue, but it's probably better to include the ##\frac{1}{batchsize}## now that you mention it.

I would say the main issue with Neilson's math/code is making the biases a column vector. It doesn't make any sense to do that, and that's why his feedforward function does not work. Making the biases a row vector will reproduce the math that I derived, which does work. Also, you don't have to average each input vector one at a time in a for loop, as Neilson does. You can do it all with a matrix multiplication, and then slap ##\frac{1}{batchsize}## in front of it. This is way faster, less complicated, and is accomplishing the same thing.

But don't worry, I'm putting together my own tutorial. It will explain everything about the basic feed forward neural net, and it will only require 10 minutes of your time. I will make a special shoutout to the physics forum, as well, since you guys have contributed quite a bit to this. It was neat to see the different ways a neural net could be interpreted and the problems that were encountered, and how the solution was finally obtained. It was a local minima! That's pretty amazing. I don't think anyone was thinking the problem was that. I am a bit disappointed with Neilson, though. But, I'm often disappointed with tutorials in general, hence why I feel like it may be profitable for me to make my own.
 
Last edited:

Similar threads

  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 9 ·
Replies
9
Views
6K
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 5 ·
Replies
5
Views
1K