Zap
- 406
- 120
It might be true that squared errors are not as good. The point is that they can be used to train a neural a net. It doesn't have to be a mean squared error. I had to effectively turn Neilson's code into mine to get it to work (minus using the average of costs per input vector), because making the biases a column vector makes numpy broadcast a row vector with a column vector each recursive step in the forward propagation function. Even if the older version of numpy allowed this kind of addition without broadcasting, it still doesn't make much sense mathematically.
I was a little frustrated at Neilson for this. I would personally rate his book very low because of this. It could be possible that I had misinterpreted something in his book, but according to me, it contains faulty information that did the opposite of aid me in understanding neural networks. That's just my experience.
My matrix math is the same as his, except I don't break everything up into single vectors and average them. I simply allow the Numpy matrix operations to add everything up for me, and then I absorb the coefficient ##\frac{1}{batch-size}## into the learning rate. So, it's effectively doing the same thing. In my opinion, my way is less confusing. It's also faster. But, of course, these are my opinions. I would be surprised if anyone truly walked away from that book with a solid understanding of what's really going on, but I'm sure it's a possibility.
Thanks for your help, though! This was a super difficult problem, and decoding it has taught me a lot.
FYI: As I mentioned before, even after debugging Neilson's code, his program gave the same result as mine. His also encounters the local minima, because he initializes his weights as random values between 0 to 1, which is the wrong approach for this problem. He probably got lucky when he ran his code, as I did last night, and the random weights just happened to initialize in a very lucky way to circumvent the local minima. Initializing the weights and biases as random values between -1 and 1 or -0.5 and 0.5 will do much better.
O, I should add, that you can get a descent model after only a single epoch of training, only if that local minima is not encountered. Adjusting the weight to random values between -0.5 and 0.5 seems to avoid the minima in all tests so far.
I was a little frustrated at Neilson for this. I would personally rate his book very low because of this. It could be possible that I had misinterpreted something in his book, but according to me, it contains faulty information that did the opposite of aid me in understanding neural networks. That's just my experience.
My matrix math is the same as his, except I don't break everything up into single vectors and average them. I simply allow the Numpy matrix operations to add everything up for me, and then I absorb the coefficient ##\frac{1}{batch-size}## into the learning rate. So, it's effectively doing the same thing. In my opinion, my way is less confusing. It's also faster. But, of course, these are my opinions. I would be surprised if anyone truly walked away from that book with a solid understanding of what's really going on, but I'm sure it's a possibility.
Thanks for your help, though! This was a super difficult problem, and decoding it has taught me a lot.
FYI: As I mentioned before, even after debugging Neilson's code, his program gave the same result as mine. His also encounters the local minima, because he initializes his weights as random values between 0 to 1, which is the wrong approach for this problem. He probably got lucky when he ran his code, as I did last night, and the random weights just happened to initialize in a very lucky way to circumvent the local minima. Initializing the weights and biases as random values between -1 and 1 or -0.5 and 0.5 will do much better.
O, I should add, that you can get a descent model after only a single epoch of training, only if that local minima is not encountered. Adjusting the weight to random values between -0.5 and 0.5 seems to avoid the minima in all tests so far.
Last edited: