Derivative of Cost Function with Respect to Output Layer Weight

Click For Summary
The discussion revolves around the difficulty in correctly deriving the cost function's derivative with respect to the output layer weights in a neural network. The user consistently arrives at an incorrect answer despite following the mathematical process, leading to confusion about the treatment of the gradient as a matrix versus a vector. The importance of averaging the gradients over training examples is emphasized, as the goal is to adjust weights based on a single representative value rather than multiple values. Suggestions include reviewing vectorized implementations and ensuring that all necessary terms, such as the 1/m factor, are included. The conversation highlights the need to focus on the overarching goal of optimizing the cost function rather than getting lost in the intricacies of linear algebra.
  • #31
It might be true that squared errors are not as good. The point is that they can be used to train a neural a net. It doesn't have to be a mean squared error. I had to effectively turn Neilson's code into mine to get it to work (minus using the average of costs per input vector), because making the biases a column vector makes numpy broadcast a row vector with a column vector each recursive step in the forward propagation function. Even if the older version of numpy allowed this kind of addition without broadcasting, it still doesn't make much sense mathematically.

I was a little frustrated at Neilson for this. I would personally rate his book very low because of this. It could be possible that I had misinterpreted something in his book, but according to me, it contains faulty information that did the opposite of aid me in understanding neural networks. That's just my experience.

My matrix math is the same as his, except I don't break everything up into single vectors and average them. I simply allow the Numpy matrix operations to add everything up for me, and then I absorb the coefficient ##\frac{1}{batch-size}## into the learning rate. So, it's effectively doing the same thing. In my opinion, my way is less confusing. It's also faster. But, of course, these are my opinions. I would be surprised if anyone truly walked away from that book with a solid understanding of what's really going on, but I'm sure it's a possibility.

Thanks for your help, though! This was a super difficult problem, and decoding it has taught me a lot.

FYI: As I mentioned before, even after debugging Neilson's code, his program gave the same result as mine. His also encounters the local minima, because he initializes his weights as random values between 0 to 1, which is the wrong approach for this problem. He probably got lucky when he ran his code, as I did last night, and the random weights just happened to initialize in a very lucky way to circumvent the local minima. Initializing the weights and biases as random values between -1 and 1 or -0.5 and 0.5 will do much better.

O, I should add, that you can get a descent model after only a single epoch of training, only if that local minima is not encountered. Adjusting the weight to random values between -0.5 and 0.5 seems to avoid the minima in all tests so far.
 
Last edited:
Technology news on Phys.org
  • #32
There's nothing wrong with my math/code. There was a local minima. Neilson's code encounters the same minima. He just got lucky when he ran his code and randomly produced the right weights at initialization and avoided it. And yes, the batch size will produce an overflow error if the learning rate is not simultaneously decreased. I's not a huge issue, but it's probably better to include the ##\frac{1}{batchsize}## now that you mention it.

I would say the main issue with Neilson's math/code is making the biases a column vector. It doesn't make any sense to do that, and that's why his feedforward function does not work. Making the biases a row vector will reproduce the math that I derived, which does work. Also, you don't have to average each input vector one at a time in a for loop, as Neilson does. You can do it all with a matrix multiplication, and then slap ##\frac{1}{batchsize}## in front of it. This is way faster, less complicated, and is accomplishing the same thing.

But don't worry, I'm putting together my own tutorial. It will explain everything about the basic feed forward neural net, and it will only require 10 minutes of your time. I will make a special shoutout to the physics forum, as well, since you guys have contributed quite a bit to this. It was neat to see the different ways a neural net could be interpreted and the problems that were encountered, and how the solution was finally obtained. It was a local minima! That's pretty amazing. I don't think anyone was thinking the problem was that. I am a bit disappointed with Neilson, though. But, I'm often disappointed with tutorials in general, hence why I feel like it may be profitable for me to make my own.
 
Last edited:

Similar threads

  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 9 ·
Replies
9
Views
6K
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 12 ·
Replies
12
Views
2K
  • · Replies 5 ·
Replies
5
Views
359