Derivative of Cost Function with Respect to Output Layer Weight

Zap · Mar 6, 2021

It might be true that squared errors are not as good. The point is that they can be used to train a neural a net. It doesn't have to be a mean squared error. I had to effectively turn Neilson's code into mine to get it to work (minus using the average of costs per input vector), because making the biases a column vector makes numpy broadcast a row vector with a column vector each recursive step in the forward propagation function. Even if the older version of numpy allowed this kind of addition without broadcasting, it still doesn't make much sense mathematically.

I was a little frustrated at Neilson for this. I would personally rate his book very low because of this. It could be possible that I had misinterpreted something in his book, but according to me, it contains faulty information that did the opposite of aid me in understanding neural networks. That's just my experience.

My matrix math is the same as his, except I don't break everything up into single vectors and average them. I simply allow the Numpy matrix operations to add everything up for me, and then I absorb the coefficient ##\frac{1}{batch-size}## into the learning rate. So, it's effectively doing the same thing. In my opinion, my way is less confusing. It's also faster. But, of course, these are my opinions. I would be surprised if anyone truly walked away from that book with a solid understanding of what's really going on, but I'm sure it's a possibility.

Thanks for your help, though! This was a super difficult problem, and decoding it has taught me a lot.

FYI: As I mentioned before, even after debugging Neilson's code, his program gave the same result as mine. His also encounters the local minima, because he initializes his weights as random values between 0 to 1, which is the wrong approach for this problem. He probably got lucky when he ran his code, as I did last night, and the random weights just happened to initialize in a very lucky way to circumvent the local minima. Initializing the weights and biases as random values between -1 and 1 or -0.5 and 0.5 will do much better.

O, I should add, that you can get a descent model after only a single epoch of training, only if that local minima is not encountered. Adjusting the weight to random values between -0.5 and 0.5 seems to avoid the minima in all tests so far.

Zap · Mar 6, 2021

There's nothing wrong with my math/code. There was a local minima. Neilson's code encounters the same minima. He just got lucky when he ran his code and randomly produced the right weights at initialization and avoided it. And yes, the batch size will produce an overflow error if the learning rate is not simultaneously decreased. I's not a huge issue, but it's probably better to include the ##\frac{1}{batchsize}## now that you mention it.

I would say the main issue with Neilson's math/code is making the biases a column vector. It doesn't make any sense to do that, and that's why his feedforward function does not work. Making the biases a row vector will reproduce the math that I derived, which does work. Also, you don't have to average each input vector one at a time in a for loop, as Neilson does. You can do it all with a matrix multiplication, and then slap ##\frac{1}{batchsize}## in front of it. This is way faster, less complicated, and is accomplishing the same thing.

But don't worry, I'm putting together my own tutorial. It will explain everything about the basic feed forward neural net, and it will only require 10 minutes of your time. I will make a special shoutout to the physics forum, as well, since you guys have contributed quite a bit to this. It was neat to see the different ways a neural net could be interpreted and the problems that were encountered, and how the solution was finally obtained. It was a local minima! That's pretty amazing. I don't think anyone was thinking the problem was that. I am a bit disappointed with Neilson, though. But, I'm often disappointed with tutorials in general, hence why I feel like it may be profitable for me to make my own.

Derivative of Cost Function with Respect to Output Layer Weight

Similar threads

How to increase phone signal strength by lying about it

A Crisis for Newly Minted CompSci Majors -- entry level jobs gone

How to calculate Tension for a series of connected points?

Learning Assembly and computer architecture for x86

Learning data structures and algorithms in different programming languages

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers