Why is my multilayered, feedforward neural network not working?

  • Thread starter Thread starter Zap
  • Start date Start date
  • Tags Tags
    Network Neural
AI Thread Summary
The discussion revolves around troubleshooting a multilayered feedforward neural network developed in Python that is not functioning as expected. The user reports that the network outputs nonsensical results after training, despite the program running without obvious errors. Key issues include identical outputs across multiple training runs and confusion regarding the implementation of backpropagation, particularly the handling of gradients as matrices versus vectors. Suggestions include debugging the code step-by-step, ensuring that the math behind the network is correct, and clarifying the expected versus actual outputs for better troubleshooting. The conversation emphasizes the complexity of neural network implementation and the importance of clear communication regarding specific problems encountered.
Zap
Messages
406
Reaction score
120
TL;DR Summary
I've no idea why it's not working ...
Hey, guys.

So, I've developed a basic multilayered, feedforward neural network from scratch in Python. However, I cannot for the life of me figure out why it is still not working. I've double checked the math like ten times, and the actual code is pretty simple. So, I have absolutely no idea where I am going wrong.

You can see my neural net on my github pages website if you follow the link below:
https://zachary-chiodini.github.io/neural-network/feedforwardneuralnetwork.html

The web page is very neat and written like a tutorial. However, it's not a very good tutorial, because the program doesn't work.

I just cannot for the life of me figure out why. If you are interested, you may peruse that web page and maybe tell me where I'm going wrong. I will appreciate it. If you are not interested, that is okay, too.

Thanks.
 
Technology news on Phys.org
Sadly, this is not a reasonable request. No one is going to slog through python code to see why it doesn't work especially in the case of neural net code where you use training data to teach it and then test data to make sure it works.

There are just so many places where it could go wrong, swpped indices, wrong indices, math function not returning what you expect.

My suggestion is to step through the code with a good python IDE like Pycharm and see if you can figure out why it is misbehaving. It may take a really long time but through persistent effort you will succeed or decide to become a manager so you can boss other people into finding the bug.

You can also start at the bottom and make sure each of your low level methods / functions are working as expected. May even creating testcases to run them thru several types of input to exercise each if branch and loop in the code.

You know its not working which means you've seen something that tells you its not working so you are the best one to debug it. We don't have that knowledge.
 
  • Like
Likes PhDeezNutz and Vanadium 50
Zap said:
it is still not working

What does "not working" mean? By itself that statement is so vague that I don't see how you could expect anyone to have any useful input.
 
  • Like
Likes Wrichik Basu and Vanadium 50
Well, there's no obvious error. The program runs, but it doesn't do what it's suppose to. The training algorithm does not seem to be working correctly. After training, the network outputs complete nonsense.

I am comparing it with the MLP class in the Sklearn library. The Sklean MLP can handle what I'm trying to do perfectly. However, my network outputs complete nonsense after training.

It also does weird things that should not be possible, like computing identical losses after instantiating and training the class multiple times, and outputting matrices that are composed entirely of a single duplicated vector or row.

The coding is based entirely off of the math. So, I'm guessing I did the math wrong, but I can't see where.
 
PeterDonis said:
What does "not working" mean? By itself that statement is so vague that I don't see how you could expect anyone to have any useful input.
Zap said:
but it doesn't do what it's suppose to.
Zap said:
It also does weird things that should not be possible

Well, that certainly clears things up! o0)

@jedishrfu and @PeterDonis are right. "Debug my code for me" is not a reasonable request, especially if you can't give a clear, simple and reproducible way to generate the error. That should be your next step - figuring out the minimum code and the minimum data to reproduce the error. Without that, we don't have a hope of figuring it out. With that, maybe you can without any help.
 
Zap said:
it doesn't do what it's suppose to

So what is it supposed to do? And what different thing is it actually doing? We can't read your mind and we can't see your computer screen by clairvoyance.

Zap said:
I am comparing it with the MLP class in the Sklearn library.

Comparing what? Expected outputs for some set of test inputs? What are they? How are we supposed to know any of this if you don't tell us the specifics?
 
You're right. I'm a little burnt out on this thing, though. I'm not sure how to give a precise description of what is not working when I have no idea what's not working.

I was hoping to get a second opinion on the math that I'm using, but it would be tough to write it all out in its entirety. If someone is very familiar with feedforward neural nets, I was hoping they might be able to take a quick look at the math I'm using and verify whether it's correct or not.

There are only slight differences between my formulas and the formulas found on the wiki page for backpropagation under matrix multiplication. Where I am calling the gradient of the cost function ##C## with respect to the input layer ##z##, or ##\nabla_{z}C##, the wiki page is calling just delta, or ##\delta##.

The main differences that I'm seeing between my formulas and the wiki page's formulas is the wiki page is using ##\nabla_{W}^{(l)}C=\delta^{(l)}(a^{(l-1)})^{T}## where I am using ##\nabla_{W}^{(l)}C=(a^{(l-1)})^{T}\delta^{(l)}##, and the wiki page is using ##\delta^{(l-1)}=(f^{(l-1)})^{'}\bullet(W^{(l)})^{T}\bullet\delta^{(l)}## where I am using ##\delta^{(l-1)}=(f^{(l-1)})^{'}\bullet\delta^{(l)}(W^{(l)})^{T}##.

I think the reason for the slight difference in the formulas, is because the wiki page is doing a special case of row-wise backpropagation, where I am doing backpropagation on the entire data set at once.

In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set, but I am having ##\delta^{(l)}## be a matrix, which is the error term for all of the data in the set. Is it not possible to do backpropagation on all of the data at once? It's intuitive to me to do backpropagation on all of the data and make ##\delta^{(l)}## a matrix, because the loss function ##C## is a function of all of the data. In my model, the loss/cost function is ##C(\hat{Y};Y)##, where the model output ##\hat{Y}## and the target matrix ##Y## are both matrices. So, taking the gradient of the loss/cost function must also be a matrix, right? How could it be a vector? And what would be the point of separating the data into batches if the gradient is being computed for each row at a time?

It's also not a lot of code. I'm just saying if anyone is interested, they can look over it. If not, that's fine.
 
Last edited:
  • Like
Likes jedishrfu
PeterDonis said:
So what is it supposed to do? And what different thing is it actually doing? We can't read your mind and we can't see your computer screen by clairvoyance.
Comparing what? Expected outputs for some set of test inputs? What are they? How are we supposed to know any of this if you don't tell us the specifics?

I'm not sure if that's entirely relevant, but I do understand your point. The error is not from setting up the problem, or applying it to a specific problem. A neural net takes an input matrix and a target matrix, does a backpropagation gradient descent algorithm and outputs an output matrix that should approximate the target matrix

The error is somewhere within the class itself. I think it may have something to do with how I developed the math model for the neural network, but it's still eluding me what exactly is wrong with the math. If the math is actually correct, than it must be some weird bug, maybe from using pointers. Who knows, but I want to make sure the math is correct. Perhaps I should have rephrased the OP.
 
You might consider posting your code to e.g. http://pythonfiddle.com/, or another python online 'fiddle' IDE, and working on it there ##-## if you make it publicly visible (with notes for what is or isn't working) perhaps even someone who has worked the same tutorial might run across your code ##\dots##
 
Last edited:
  • #10
How might the loss function compute a total loss that is exactly the same every time you train the network? Shouldn't that be impossible, since the weights are initialized randomly? I am not specifying any seed for the random weights, and it would be unlikely that Numpy would seed the random variables exactly the same multiple times. This should not be possible, and is likely not due to an incorrect mathematical formula.

So, I'm starting to think that perhaps the math is correct, even though it is a little bit different than what is written on Wikipedia. This makes me think it's a coding error, but I only have like 80 lines of code here, and I'm not seeing it.
 
  • #11
Zap said:
In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set, but I am having ##\delta^{(l)}## be a matrix, which is the error term for all of the data in the set. Is it not possible to do backpropagation on all of the data at once?

You can do backpropagation on all the data at one time (called 'batch' gradient descent), in mini-batches (called 'mini-batch stochastic gradient descent') or one sample at a time (called 'stochastic gradient descent', terminology varies): https://ruder.io/optimizing-gradient-descent/
 
  • #12
I would test your numpy assumption on random weights or see how they are used in calculation perhaps you're doing something to make them "predictable".
 
  • #13
I thought that doing the gradient over all of the data or batches of the data made more sense. The Wikipedia article was confusing me. I guess it tries to simplify things by talking about a single vector, but that just totally confused me.

The output was the same no matter what the networks were with 100 neurons and 1 hidden layer. The weights kept changing, but the output would still be the same, for some reason. That's why the loss was a constant over all random weights.

This is an output I am getting after training with 30 neurons and 1 hidden layer, which doesn't make sense. For some reason every output in the output matrix is exactly the same. It makes no sense, but I think I can say this is probably not due to a mathematical error.
Python:
>>>FFNN.forwardpropagation( X )
array([[1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       ...,
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14]])
 
Last edited:
  • #14
Write comments in your code: for an algorithm like this I would expect more lines of comment than there are lines of code.

Except in trivial loops where i, j and k are ok, make your indices look like indices. If you are frequently traversing over a fixed number of elements, set that number as a 'constant' at the beginning of the function: so instead of
Python:
for hidden_layer in range( len( self.network ) - 1, 0, -1 ) :
write something like
Python:
# Set some reusable loop bounds.
last_network_layer = len(self.network) - 1
...
# Traverse the hidden layers top down.
for hidden_layer_index in range( last_network_layer, 0, -1 ) :

Remember Python lists are indexed from 0 to len(myList) - 1.

Make use of infix operators where relevant, and be clear and consistent with your variable names (grad_b seems to be the derivative of the bias, but what is grad_w?): so instead of
Python:
self.network[ hidden_layer ] = self.network[ hidden_layer ] - r*grad_w
self.bias[ hidden_layer ] = self.bias[ hidden_layer ] - r*grad_b
write something like
Python:
# Reduce the hidden layer's network and bias values by their scaled derivatives.
self.network[ hidden_layer_index ] -= step_size * grad_network
self.bias[ hidden_layer_index ] -= step_size * grad_bias
 
Last edited:
  • Like
  • Informative
Likes atyy, berkeman and Zap
  • #15
  • Like
Likes pbuk and atyy
  • #16
He seems to be applying the gradient descent row-wise, or treating ##\delta## as a vector instead of a matrix, which I don't understand. He used batches, yet, splits each batch into vectors, so it doesn't make any sense to me.
 
  • #17
Zap said:
Summary:: I've no idea why it's not working ...

Hey, guys.

So, I've developed a basic multilayered, feedforward neural network from scratch in Python. However, I cannot for the life of me figure out why it is still not working. I've double checked the math like ten times, and the actual code is pretty simple. So, I have absolutely no idea where I am going wrong.

You can see my neural net on my github pages website if you follow the link below:
https://zachary-chiodini.github.io/neural-network/feedforwardneuralnetwork.html

The web page is very neat and written like a tutorial. However, it's not a very good tutorial, because the program doesn't work.

I just cannot for the life of me figure out why. If you are interested, you may peruse that web page and maybe tell me where I'm going wrong. I will appreciate it. If you are not interested, that is okay, too.

Thanks.

1614155738548.png

I recommend changing your color scheme.

One thing that I noticed is that your derivative code doesn't match your math.

1614157035379.png


Zap said:
How might the loss function compute a total loss that is exactly the same every time you train the network? Shouldn't that be impossible, since the weights are initialized randomly? I am not specifying any seed for the random weights, and it would be unlikely that Numpy would seed the random variables exactly the same multiple times.

If you're not seeding the random number generator, then I would expect it to be exactly the same weights every time. Seeding the random number generator (with a different number each time you run the program, e.g. with the time) is how you would get different ones.

You might have a bunch of simple low level mistakes here and there. I recommend to trace through your code, checking that every line of your code is doing what you expect it to do. If your output is not even changing, then you must have something wrong. Is it entering your loops? When it is supposed to be updating the outputs, what is happening?

Also, I don't know what your inputs and target outputs are.

No need to stress over a possible math error, when you probably have some simple coding errors anyway.
 
Last edited:
  • #18
Jarvis323 said:
I recommend changing your color scheme.
View it with Google's Dark Reader extension lol

My whole reason for doing this is (1) to understand a basic feed forward neural network entirely and (2) to make a multi-part instructional video to post on YouTube to spruce up my online portfolio and potentially make some money. So, I will be screen recording the tutorial while using Google Dark Reader extension, hence the strange color scheme. However, I can't seem to get my code to work! I was so close to finishing this ... I even have about 120 powerpoint slides painstakingly deriving all of the math. I put so much effort into this.

Jarvis323 said:
If you're not seeding the random number generator, then I would expect it to be exactly the same weights every time. Seeding the random number generator (with a different number each time you run the program, e.g. with the time) is how you would get different ones.
I believe Numpy automatically reseeds the random number generator, probably based on the current time, or something like that.
 
Last edited:
  • #19
Jarvis323 said:
One thing that I noticed is that your derivative code doesn't match your math.
The math is showing the gradient applied to the ##m^{th}## example. The code is applying the gradient to the entire input matrix. At least, that's what I think it's doing.

O you meant for the activation function. I believe I made a typo in the math. Good catch.

Jarvis323 said:
No need to stress over a possible math error, when you probably have some simple coding errors anyway.

I'm thinking the math is most likely correct, because when I start rearranging things, the matrix multiplications fail due to the operands dimensions not matching up.
 
Last edited:
  • #20
:cry:
 
  • #21
I made the recommended changes and tried to make it a little more readable, but I still cannot find out where there is an error. I'm thinking it must be a mathematical error, because the code is very simple, and I don't see any error in the code.
 
  • #22
Try flow charting it.
The attention to detail required along with the parallel processing of your visual channel is surprisingly effective.
 
  • #23
Python:
class FeedForwardNeuralNetwork( FeedForwardNeuralNetwork  ) :
  
    def __forwardpropagation( self, X : Input_Matrix, A : Dict, Z : Dict ) -> Output_Matrix :
        ...
        for layer_n in range( 1, len( self.weights ) + 1 ) :
            Z[ layer_n ] = np.matmul( A[ layer_n - 1 ],  self.weights[ layer_n ]  ) + self.biases[ layer_n ]
            ...
It looks like you have some off by 1 errors. The loop goes over layer_n = 1 to len(self.weights). On the last iteration, when layer_n=len(weights), you are accessing self.weights[ len( self.weights ) ].

This should give an error, list index out of bounds, which makes me think that either len( self.weights )=0, so the loop isn't even entered, or __forwardpropagation is never actually called. Maybe the first loop condition in the train function, np.sqrt( totgrad ) > convergence, is never true?

After fixing your indexing errors, answer some questions.

Have you tried tracing the execution? What happens? Is it even going into the loops? At which point does something unexpected happen?

If you have to, input something simple, so you know what the results should look like in each operation.

Edit:

Also, you have the same off by 1 error in train, again trying to access weights[ len(weights) ], which tells us again that the code in that inner loop is never executed (because if it was, you would get an error and the program would halt).

Or maybe your program is halting with errors and you're not noticing it?

Also, why are weights[ 0 ] and biases[ 0 ] left undefined?
 
Last edited:
  • #24
Yea. It might be a little confusing, but I'm not using lists to store the weights and biases. I'm using two dictionaries with keys that are integer values. The keys are the layer numbers. The input layer is the zeroth layer and does not have any weights or biases. Hence why there is no weights[0] or biases[0].

I decided to use a dictionary, because it's easier to populate and repopulate. Also, the math I derived is treating the input layer as ##l = 0##, and the output layer as ##l = number\ of\ hidden\ layers + 1##. So, I wanted to follow that convention.

Also, when storing the weighted input of each layer ##Z## and the output of each layer ##A##, it is easier to follow mathematically if they are dictionaries with layer number as keys. That's why I decided not to use ##i## in the for loops.

So, there are no lists in this program lol, except for the keyword argument "hidden_layer."
 
Last edited:
  • #25
I figured out why the program outputs an array of identical vectors, as seen in post #13.

It is because the weighted input to hidden layer 1 is a matrix of large numbers, due to the dot product of a large number of features, so that the output of layer 1 is an array of nothing but 1s, because the sigmoid activation function returns approximately 1 for any number larger than like 5. The weighted input matrix to the first hidden layer has values of around 50. Then, when the next layer receives this array of 1s, it applies its weights and activation function and outputs an array of identical vectors, and so on.
 
Last edited:
  • #26
Zap said:
Yea. It might be a little confusing, but I'm not using lists to store the weights and biases. I'm using two dictionaries with keys that are integer values. The keys are the layer numbers. The input layer is the zeroth layer and does not have any weights or biases. Hence why there is no weights[0] or biases[0].

I decided to use a dictionary, because it's easier to populate and repopulate. Also, the math I derived is treating the input layer as ##l = 0##, and the output layer as ##l = number\ of\ hidden\ layers + 1##. So, I wanted to follow that convention.

Also, when storing the weighted input of each layer ##Z## and the output of each layer ##A##, it is easier to follow mathematically if they are dictionaries with layer number as keys. That's why I decided not to use ##i## in the for loops.

So, there are no lists in this program lol, except for the keyword argument "hidden_layer."
Ahh, I see. Yeah, that threw me off.
 
  • #27
Zap said:
I decided to use a dictionary, because it's easier to populate and repopulate.
I think that one of the things you should take away from this exercise is that that was not a good decision.
 
  • #28
pbuk said:
I think that one of the things you should take away from this exercise is that that was not a good decision.
Why though?

I'm almost positive it's a mathematical error, now. I suppose using dicts with interger keys may be a little confusing, but it follows the math notation, so I'll prob stick to it until the program actually starts working. Then, maybe I'll go in and change the dicts to lists. Maybe I'll include an input layer into the lists of weights and biases, so it will still follow the math notation.

For now, I have to compare the math that I derived and used with other sources. Hopefully, I find another source that is also treating ##\delta## as a matrix, instead of a vector, or at least one that explains why ##\delta## should be a vector and not a matrix.

The strange thing is that the program does minimize the cost function, or at least seems to minimize it. Somehow, it achieves a model with a smaller cost than what the SkLearn module produces. However, the output is complete nonsense. It's just a matrix of identical vectors, and every vector is composed of nothing by zeros. Somehow, this is a better model, according to the sum of squared residuals.

I am applying the model to the MNIST training set, which is composed of 60,000 training examples. My erroneous model achieves a cost of 59,999, which makes sense if it outputs a matrix of nothing but zeros. I don't think it's an issue of overfitting, because if I decrease the training time or number of iterations, the cost goes up and it still outputs a matrix of identical vectors. The SkLearn models cost is typically around 70,000-100,000, but are actual working models.

It's kind of funny that a nonsense model that outputs nothing but zeros is a better model according to the cost function lol. wish this were just an issue of over training, but it's not. I'm convinced the math is wrong.
 
Last edited:
  • #29
I'm actually reading that using integer values as keys for a dictionary improves performance. Why is this a bad idea in this context? When using a list, the code becomes less readable, in my opinion, due to the way lists are indexed and that each element in the list must be contiguous. This puts constraints on how you can represent the network, and forces you to represent it in such a way that is not at all intuitive, mathematically.

For example, the weighted input to the zeroth layer (input layer) does not exist. So what would you put at index 0 for the list of weighted inputs to each layer? It would also be strange to include an input layer in the list of weights, since the input layer does not have any weights. So, to avoid having a different index for the weights, the weighted inputs and the outputs of each layer, and have none of them follow the mathematical notation, I decided to use a dictionary.
 
Last edited:
  • #30
Zap said:
performance

Why are you worried about performance before you have gotten the right answer? If a fast but wrong answer is acceptable, I can show you some very fast algorithms.
 
  • #31
I'm just asking why I should not be using a dictionary to store weights and biases for the neural network.

So, I am confident I am dealing with a mathematical issue, which stems from my definition of the gradient. I'm still not exactly sure why it's wrong. But, I think that is where the issue is coming from.
 
  • #32
In all internet sources, the gradient is being treated as a vector, where I am treating it as a matrix.

I treat it as a matrix, because the cost function is a function of a matrix. It is also a function of the activation values of the output layer for every example in the input matrix, not just for a single vector example.

However, for some reason, the internet is telling me that the gradient should be a vector, and should be computed only for a single vector example at a time.

That's why my equations look different that what is found on internet recourses, and I'm assuming that that is why my neural net is not working. But, why can't the gradient be computed as a matrix, considering all of the vectors in the input matrix, for which the cost function is a function of? And what is the point of batches if you can only apply the gradient per input vector?

What is the reason for this restriction? Why can't the gradient be a matrix, as I've defined in the link in the OP?
 
  • #33
Zap said:
I'm just asking why I should not be using a dictionary to store weights and biases for the neural network.
  1. Because the data you want to represent has all of the characteristics of a list and none of the characteristics of a dictionary.
  2. Because at least 2 people reading this thread have assumed you were using a list; when you return to this code in 12 months time so will you.
  3. Because for..in on a dictionary is not guaranteed to traverse the keys in numerical order.
  4. ...

Zap said:
I'm actually reading that using integer values as keys for a dictionary improves performance.
Against using non-integer values maybe, because the hashing function can be cheaper, but a list doesn't need a hashing function so is always faster.

An even better solution would be a numpy.array.
  1. It is the 'standard tool' for writing code like this.
  2. It gives you concise access to code implementing mathematical vector operations e.g. overloaded multiplication operators: your code will look a lot more like the maths.
  3. Because list traversals are implemented in C they will always be quicker than userland code so you will not have to refactor for optimum performance.
  4. ...
 
Last edited:
  • #34
I think you should have made it clearer that you are not attempting to implement the backpropagation algorithm on the wikipedia page, you are trying to implement an algorithm you have invented yourself. This algorithm will never work as it is based on a fundamental misunderstanding:

Zap said:
In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set

It is not. It is the difference between the current values of the coefficients in layer ## l ## and their unknown optimum values.

Here is my recommendation:
  1. study the subject so that you understand how it works; I would never recommend Wikipedia for this, in this case neuralnetworksanddeeplearning.com is a great free resource (which looks to have been quoted wholesale in Wikipedia, but why rely on the accuracy of that quote when you can go to the original?)
  2. study how scikit-learn does it https://github.com/scikit-learn/sci...earn/neural_network/_multilayer_perceptron.py
  3. implement the standard algorithm (or some other proven algorithm) using Python structures designed for the purpose (with the layers represented by a list of numpy.arrays)
Once you have done this you can start designing and implementing your own 'improved' supervised learning algorithm.
 
  • Like
Likes Vanadium 50
  • #35
pbuk said:
  1. Because the data you want to represent has all of the characteristics of a list and none of the characteristics of a dictionary.
  2. Because at least 2 people reading this thread have assumed you were using a list; when you return to this code in 12 months time so will you.
  3. Because for..in on a dictionary is not guaranteed to traverse the keys in numerical order.
  4. ...
Against using non-integer values maybe, because the hashing function can be cheaper, but a list doesn't need a hashing function so is always faster.

An even better solution would be a numpy.array.
  1. It is the 'standard tool' for writing code like this.
  2. It gives you concise access to code implementing mathematical vector operations e.g. overloaded multiplication operators: your code will look a lot more like the maths.
  3. Because list traversals are implemented in C they will always be quicker than userland code so you will not have to refactor for optimum performance.
  4. ...
Understood. When using Numpy, I'd have to create an array of lists, where each list contains an array of weights, so that each element in the outermost array has the same dimensions.
 
  • #36
pbuk said:
I think you should have made it clearer that you are not attempting to implement the backpropagation algorithm on the wikipedia page, you are trying to implement an algorithm you have invented yourself. This algorithm will never work as it is based on a fundamental misunderstanding:
It is not. It is the difference between the current values of the coefficients in layer ## l ## and their unknown optimum values.

Here is my recommendation:
  1. study the subject so that you understand how it works; I would never recommend Wikipedia for this, in this case neuralnetworksanddeeplearning.com is a great free resource (which looks to have been quoted wholesale in Wikipedia, but why rely on the accuracy of that quote when you can go to the original?)
  2. study how scikit-learn does it https://github.com/scikit-learn/sci...earn/neural_network/_multilayer_perceptron.py
  3. implement the standard algorithm (or some other proven algorithm) using Python structures designed for the purpose (with the layers represented by a list of numpy.arrays)
Once you have done this you can start designing and implementing your own 'improved' supervised learning algorithm.
I'm trying to implement the backpropagation algorithm, I've just derived that algorithm myself instead of just copying some stuff from a website and plugging and chugging it.

When I take the derivative of the cost function with respect to one of the weights in the output layer, I get a gradient ##\nabla_{a}C## that is a matrix after applying the general chain rule on the cost function. For some reason, the algorithm found in http://neuralnetworksanddeeplearning.com and in wikipedia is treating the same gradient ##\nabla_{a}C## as a vector. This doesn't make any sense to me, unless they are applying the cost function to one example at a time. But why are they doing that? The cost function is a function of all examples in the output matrix.

It will take some extra effort on my part, but I can show you how I derive ##\nabla_{a}C## as a matrix, simply by applying the chain rule when trying to take the derivative of the cost ##C## with respect to one of the weights in the output layer. ##\nabla_{a}C## is a vector only if you are treating the cost function ##C## as being a function of only one output vector. I could make a separate thread about this, because I believe I have now identified this as being the issue in my program. However, I don't understand why it's an issue, since ##C## is a function of the output matrix, not only an output vector.
 
Last edited:
  • #37
As someone already mentioned, what you're trying to do is implement batch training. If you google that you should find the info you need pretty easily.

When you train in batches, you accumulate the gradients from each instance before back propagation.
 
  • Like
Likes pbuk
  • #38
So, if ##\nabla_{a}C## can indeed be a matrix, then my math is probably not wrong. Also, the deeplearning website claims to be doing batch training, but it still treats ##\nabla_{a}C## as a vector.
 
Last edited:
  • #39
Zap said:
This doesn't make any sense to me, unless they are applying the cost function to one example at a time.
That is exactly what 'they' are doing. Quoting from Michael Nielsen's online book near here:
As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, ## C=C_x##. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

Zap said:
But why are they doing that? The cost function is a function of all examples in the output matrix.
Because that is how this algorithm works: it calculates the gradient of the cost function for a single example. We average over all the ## m ## training examples ## x ## (in the mini-batch) in the gradient descent step:

$$ \begin{align} w^l & \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T \\
b^l & \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l} \end{align} $$

Quoting again from Nielsen:
Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δl and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.
Again my recommendation to you: study Nielsen's maths, algorithm and his reference Python implementation. Once you have a thorough understanding you will be in a position to judge whether calculating ## \nabla_{x,\delta}C ## instead of ## \frac\eta m \Sigma_{x} \nabla_{\delta} C_x ## is a good idea.
 
  • Like
Likes Vanadium 50
  • #40
pbuk said:
Because that is how this algorithm works

Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.

$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$

If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread here.

It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.

So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?

It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.

Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.

I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.

Nielsen says
... what backpropagation actually let's us do is compute the partial derivatives ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for a single training example. We then recover ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}## by averaging over training examples.
but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.
 
Last edited:
  • #41
Zap said:
Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.

$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$

If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread here.

It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.

So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?

It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.

Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.

I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.

Nielsen says

but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.
I think you might be getting confused by focusing on the linear algebra before understanding the goal, the topology, and the function that you're trying to learn.

The point of the neural network is not to learn a function that maps a set of elements to a set of targets. The point is to learn a function that maps an instance to its output, that generalizes so that the same function works for new unseen instances. In other words, its modeling ##f(\mathbf{x})## not ##f(\mathbf{X})##

Think about the topology of the network and the function it computes. It doesn't have multiple examples and multiple targets. It's a function that maps a single instance to a single output. You don't have separate edges for different instance outputs. So how could you back propagate the errors of all instances at once (through the same edges) without accumulating them?

The implementation might support applying that function to a bunch of examples in one call, but it's the same function being applied to each example independently.
 
Last edited:
  • Like
Likes Zap and pbuk
  • #42
Jarvis323 said:
I think you might be getting confused by focusing on the linear algebra before understanding the goal, the topology, and the function that you're trying to learn.

The point of the neural network is not to learn a function that maps a set of elements to a set of targets. The point is to learn a function that maps an instance to its output, that generalizes so that the same function works for new unseen instances.

Think about the topology of the network and the function it computes. It doesn't have multiple examples and multiple targets. It's a function that maps a single instance to a single output. You don't have separate edges for different instance outputs. So how could you back propagate the errors of all instances at once (through the same edges) without accumulating them?

The implementation might support applying that function to a bunch of examples in one call, but it's the same function being applied to each example independently.

Hmm ... I think we are really pin pointing why my neural network does not work right now. It was neither a coding nor a mathematical error. It was more like a misconception of what a neural network is ... That's actually pretty amazing. I'm not entirely understanding what you're saying here. I think I'll need some time to think about it further. Thank you for working this out with me.
 
  • #43
I think what you've posted needs to be explicitly stated, because it's not obvious why ##\nabla_{a}C## needs to be a vector, if we are treating the algorithm as a math problem. We've dissected here that it's not exactly a straight forward math problem. There's something conceptual here, that is not found by simply solving for the gradient of the cost function with respect to the weights of the neural network, and is not obviously implied by simply stating that ##\nabla_{a}C## must be a vector.
 
  • #44
Maybe check out this video? See if it helps conceptually?

 
  • Like
Likes Tom.G and pbuk
  • #45
It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking. For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals. It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.
 
  • #46
Zap said:
It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking. For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals. It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.
I guess it's just a naming conventions.

Loss function is usually a function defined on a data point, prediction and label, and measures the penalty.
...
Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization).
...
Long story short, I would say that:

A loss function is a part of a cost function which is a type of an objective function.

https://stats.stackexchange.com/que...unction-loss-function-are-they-the-same-thing
 
  • #47
Zap said:
I think what you've posted needs to be explicitly stated, because it's not obvious why ##\nabla_{a}C## needs to be a vector.
What @Jarvis323 posted is obvious to me because what we are trying to do is change each of the coefficients so that the network performs better. For each coefficient we therefore need a single number to adjust it by, not a column of a matrix.

Zap said:
It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking.
According to whom?

Zap said:
For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals.
What do you think the word 'mean' signifies in 'mean squared error'?

Zap said:
It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.
Look, you've misunderstood something. There's nothing wrong with that, this is a complicated topic and to avoid what Nielsen calls 'index hell' he takes a few short cuts with the notation and terminology. What is wrong is complaining that everyone else is wrong for not explaining it so you understood it first time round. That is not a mature attitude to learning.
 
Last edited:
  • Like
Likes Vanadium 50
  • #48
pbuk said:
What do you think the word 'mean' signifies in 'mean squared error'?

I was talking about the residual sum of squares, not the mean squared error.

pbuk said:
Look, you've misunderstood something. There's nothing wrong with that, this is a complicated topic and to avoid what Nielsen calls 'index hell' he takes a few short cuts with the notation and terminology. What is wrong is complaining that everyone else is wrong for not explaining it so you understood it first time round. That is not a mature attitude to learning.

I'm not sure where you're coming from with this. I don't think I've complained or done anything wrong, here. It was never explained why the gradient had to be a vector until post #41. Not that I require anyone to respond with an explanation, but I will continue asking questions when responded to as much as I like. If I can't ask questions, then there's no point in being on this forum, and the mods should promptly ban me.

A loss or cost function in regression analysis is not the same as the loss/cost function that is described here. They are not the same, because the approach that I've taken here will work for something like a linear regression. I think that someone could easily make that mistake. That's my opinion. It's not me having an attitude. I think it's disrespectful of you and it's irrelevant to the thread to post something like that. It contributes absolutely nothing and attempts to attack my character.
 
Last edited:
  • #49
Zap said:
I think it's disrespectful of you and it's irrelevant to the thread to post something like that. It contributes absolutely nothing and attempts to attack my character.
I don't think it should be read this way. It is generally difficult to communicate non visually. People tend to see only their own perspective and disregard the fact, that such a perspective might not necessarily be the one of others. If you feel misunderstood, then try to explain yourself as patient as others try to help you. We are at post #48 and maybe people are running out of ideas where the problem actually can be located.

@all: Please try to remain on topic.
 
  • #50
The problem was resolved at post #41. I do appreciate everyone who helped out and were not condescending.
 
Back
Top