Why is my multilayered, feedforward neural network not working?

In summary: I'm not sure what the other name is, but I think it's something like ##\frac{d}{dt}\left[\frac{\partial C}{\partial z}\right]##.I'm comparing it with the MLP class in the Sklearn library.Comparing what? MLP class in the Sklearn library? Which MLP class?
  • #1
Zap
406
120
TL;DR Summary
I've no idea why it's not working ...
Hey, guys.

So, I've developed a basic multilayered, feedforward neural network from scratch in Python. However, I cannot for the life of me figure out why it is still not working. I've double checked the math like ten times, and the actual code is pretty simple. So, I have absolutely no idea where I am going wrong.

You can see my neural net on my github pages website if you follow the link below:
https://zachary-chiodini.github.io/neural-network/feedforwardneuralnetwork.html

The web page is very neat and written like a tutorial. However, it's not a very good tutorial, because the program doesn't work.

I just cannot for the life of me figure out why. If you are interested, you may peruse that web page and maybe tell me where I'm going wrong. I will appreciate it. If you are not interested, that is okay, too.

Thanks.
 
Technology news on Phys.org
  • #2
Sadly, this is not a reasonable request. No one is going to slog through python code to see why it doesn't work especially in the case of neural net code where you use training data to teach it and then test data to make sure it works.

There are just so many places where it could go wrong, swpped indices, wrong indices, math function not returning what you expect.

My suggestion is to step through the code with a good python IDE like Pycharm and see if you can figure out why it is misbehaving. It may take a really long time but through persistent effort you will succeed or decide to become a manager so you can boss other people into finding the bug.

You can also start at the bottom and make sure each of your low level methods / functions are working as expected. May even creating testcases to run them thru several types of input to exercise each if branch and loop in the code.

You know its not working which means you've seen something that tells you its not working so you are the best one to debug it. We don't have that knowledge.
 
  • Like
Likes PhDeezNutz and Vanadium 50
  • #3
Zap said:
it is still not working

What does "not working" mean? By itself that statement is so vague that I don't see how you could expect anyone to have any useful input.
 
  • Like
Likes Wrichik Basu and Vanadium 50
  • #4
Well, there's no obvious error. The program runs, but it doesn't do what it's suppose to. The training algorithm does not seem to be working correctly. After training, the network outputs complete nonsense.

I am comparing it with the MLP class in the Sklearn library. The Sklean MLP can handle what I'm trying to do perfectly. However, my network outputs complete nonsense after training.

It also does weird things that should not be possible, like computing identical losses after instantiating and training the class multiple times, and outputting matrices that are composed entirely of a single duplicated vector or row.

The coding is based entirely off of the math. So, I'm guessing I did the math wrong, but I can't see where.
 
  • #5
PeterDonis said:
What does "not working" mean? By itself that statement is so vague that I don't see how you could expect anyone to have any useful input.
Zap said:
but it doesn't do what it's suppose to.
Zap said:
It also does weird things that should not be possible

Well, that certainly clears things up! o0)

@jedishrfu and @PeterDonis are right. "Debug my code for me" is not a reasonable request, especially if you can't give a clear, simple and reproducible way to generate the error. That should be your next step - figuring out the minimum code and the minimum data to reproduce the error. Without that, we don't have a hope of figuring it out. With that, maybe you can without any help.
 
  • #6
Zap said:
it doesn't do what it's suppose to

So what is it supposed to do? And what different thing is it actually doing? We can't read your mind and we can't see your computer screen by clairvoyance.

Zap said:
I am comparing it with the MLP class in the Sklearn library.

Comparing what? Expected outputs for some set of test inputs? What are they? How are we supposed to know any of this if you don't tell us the specifics?
 
  • #7
You're right. I'm a little burnt out on this thing, though. I'm not sure how to give a precise description of what is not working when I have no idea what's not working.

I was hoping to get a second opinion on the math that I'm using, but it would be tough to write it all out in its entirety. If someone is very familiar with feedforward neural nets, I was hoping they might be able to take a quick look at the math I'm using and verify whether it's correct or not.

There are only slight differences between my formulas and the formulas found on the wiki page for backpropagation under matrix multiplication. Where I am calling the gradient of the cost function ##C## with respect to the input layer ##z##, or ##\nabla_{z}C##, the wiki page is calling just delta, or ##\delta##.

The main differences that I'm seeing between my formulas and the wiki page's formulas is the wiki page is using ##\nabla_{W}^{(l)}C=\delta^{(l)}(a^{(l-1)})^{T}## where I am using ##\nabla_{W}^{(l)}C=(a^{(l-1)})^{T}\delta^{(l)}##, and the wiki page is using ##\delta^{(l-1)}=(f^{(l-1)})^{'}\bullet(W^{(l)})^{T}\bullet\delta^{(l)}## where I am using ##\delta^{(l-1)}=(f^{(l-1)})^{'}\bullet\delta^{(l)}(W^{(l)})^{T}##.

I think the reason for the slight difference in the formulas, is because the wiki page is doing a special case of row-wise backpropagation, where I am doing backpropagation on the entire data set at once.

In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set, but I am having ##\delta^{(l)}## be a matrix, which is the error term for all of the data in the set. Is it not possible to do backpropagation on all of the data at once? It's intuitive to me to do backpropagation on all of the data and make ##\delta^{(l)}## a matrix, because the loss function ##C## is a function of all of the data. In my model, the loss/cost function is ##C(\hat{Y};Y)##, where the model output ##\hat{Y}## and the target matrix ##Y## are both matrices. So, taking the gradient of the loss/cost function must also be a matrix, right? How could it be a vector? And what would be the point of separating the data into batches if the gradient is being computed for each row at a time?

It's also not a lot of code. I'm just saying if anyone is interested, they can look over it. If not, that's fine.
 
Last edited:
  • Like
Likes jedishrfu
  • #8
PeterDonis said:
So what is it supposed to do? And what different thing is it actually doing? We can't read your mind and we can't see your computer screen by clairvoyance.
Comparing what? Expected outputs for some set of test inputs? What are they? How are we supposed to know any of this if you don't tell us the specifics?

I'm not sure if that's entirely relevant, but I do understand your point. The error is not from setting up the problem, or applying it to a specific problem. A neural net takes an input matrix and a target matrix, does a backpropagation gradient descent algorithm and outputs an output matrix that should approximate the target matrix

The error is somewhere within the class itself. I think it may have something to do with how I developed the math model for the neural network, but it's still eluding me what exactly is wrong with the math. If the math is actually correct, than it must be some weird bug, maybe from using pointers. Who knows, but I want to make sure the math is correct. Perhaps I should have rephrased the OP.
 
  • #9
You might consider posting your code to e.g. http://pythonfiddle.com/, or another python online 'fiddle' IDE, and working on it there ##-## if you make it publicly visible (with notes for what is or isn't working) perhaps even someone who has worked the same tutorial might run across your code ##\dots##
 
Last edited:
  • #10
How might the loss function compute a total loss that is exactly the same every time you train the network? Shouldn't that be impossible, since the weights are initialized randomly? I am not specifying any seed for the random weights, and it would be unlikely that Numpy would seed the random variables exactly the same multiple times. This should not be possible, and is likely not due to an incorrect mathematical formula.

So, I'm starting to think that perhaps the math is correct, even though it is a little bit different than what is written on Wikipedia. This makes me think it's a coding error, but I only have like 80 lines of code here, and I'm not seeing it.
 
  • #11
Zap said:
In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set, but I am having ##\delta^{(l)}## be a matrix, which is the error term for all of the data in the set. Is it not possible to do backpropagation on all of the data at once?

You can do backpropagation on all the data at one time (called 'batch' gradient descent), in mini-batches (called 'mini-batch stochastic gradient descent') or one sample at a time (called 'stochastic gradient descent', terminology varies): https://ruder.io/optimizing-gradient-descent/
 
  • #12
I would test your numpy assumption on random weights or see how they are used in calculation perhaps you're doing something to make them "predictable".
 
  • #13
I thought that doing the gradient over all of the data or batches of the data made more sense. The Wikipedia article was confusing me. I guess it tries to simplify things by talking about a single vector, but that just totally confused me.

The output was the same no matter what the networks were with 100 neurons and 1 hidden layer. The weights kept changing, but the output would still be the same, for some reason. That's why the loss was a constant over all random weights.

This is an output I am getting after training with 30 neurons and 1 hidden layer, which doesn't make sense. For some reason every output in the output matrix is exactly the same. It makes no sense, but I think I can say this is probably not due to a mathematical error.
Python:
>>>FFNN.forwardpropagation( X )
array([[1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       ...,
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14],
       [1.57180290e-20, 9.99999881e-01, 5.68244953e-23, ...,
        3.26637459e-14, 8.78497753e-74, 2.42167327e-14]])
 
Last edited:
  • #14
Write comments in your code: for an algorithm like this I would expect more lines of comment than there are lines of code.

Except in trivial loops where i, j and k are ok, make your indices look like indices. If you are frequently traversing over a fixed number of elements, set that number as a 'constant' at the beginning of the function: so instead of
Python:
for hidden_layer in range( len( self.network ) - 1, 0, -1 ) :
write something like
Python:
# Set some reusable loop bounds.
last_network_layer = len(self.network) - 1
...
# Traverse the hidden layers top down.
for hidden_layer_index in range( last_network_layer, 0, -1 ) :

Remember Python lists are indexed from 0 to len(myList) - 1.

Make use of infix operators where relevant, and be clear and consistent with your variable names (grad_b seems to be the derivative of the bias, but what is grad_w?): so instead of
Python:
self.network[ hidden_layer ] = self.network[ hidden_layer ] - r*grad_w
self.bias[ hidden_layer ] = self.bias[ hidden_layer ] - r*grad_b
write something like
Python:
# Reduce the hidden layer's network and bias values by their scaled derivatives.
self.network[ hidden_layer_index ] -= step_size * grad_network
self.bias[ hidden_layer_index ] -= step_size * grad_bias
 
Last edited:
  • Like
  • Informative
Likes atyy, berkeman and Zap
  • #15
  • Like
Likes pbuk and atyy
  • #16
He seems to be applying the gradient descent row-wise, or treating ##\delta## as a vector instead of a matrix, which I don't understand. He used batches, yet, splits each batch into vectors, so it doesn't make any sense to me.
 
  • #17
Zap said:
Summary:: I've no idea why it's not working ...

Hey, guys.

So, I've developed a basic multilayered, feedforward neural network from scratch in Python. However, I cannot for the life of me figure out why it is still not working. I've double checked the math like ten times, and the actual code is pretty simple. So, I have absolutely no idea where I am going wrong.

You can see my neural net on my github pages website if you follow the link below:
https://zachary-chiodini.github.io/neural-network/feedforwardneuralnetwork.html

The web page is very neat and written like a tutorial. However, it's not a very good tutorial, because the program doesn't work.

I just cannot for the life of me figure out why. If you are interested, you may peruse that web page and maybe tell me where I'm going wrong. I will appreciate it. If you are not interested, that is okay, too.

Thanks.

1614155738548.png

I recommend changing your color scheme.

One thing that I noticed is that your derivative code doesn't match your math.

1614157035379.png


Zap said:
How might the loss function compute a total loss that is exactly the same every time you train the network? Shouldn't that be impossible, since the weights are initialized randomly? I am not specifying any seed for the random weights, and it would be unlikely that Numpy would seed the random variables exactly the same multiple times.

If you're not seeding the random number generator, then I would expect it to be exactly the same weights every time. Seeding the random number generator (with a different number each time you run the program, e.g. with the time) is how you would get different ones.

You might have a bunch of simple low level mistakes here and there. I recommend to trace through your code, checking that every line of your code is doing what you expect it to do. If your output is not even changing, then you must have something wrong. Is it entering your loops? When it is supposed to be updating the outputs, what is happening?

Also, I don't know what your inputs and target outputs are.

No need to stress over a possible math error, when you probably have some simple coding errors anyway.
 
Last edited:
  • #18
Jarvis323 said:
I recommend changing your color scheme.
View it with Google's Dark Reader extension lol

My whole reason for doing this is (1) to understand a basic feed forward neural network entirely and (2) to make a multi-part instructional video to post on YouTube to spruce up my online portfolio and potentially make some money. So, I will be screen recording the tutorial while using Google Dark Reader extension, hence the strange color scheme. However, I can't seem to get my code to work! I was so close to finishing this ... I even have about 120 powerpoint slides painstakingly deriving all of the math. I put so much effort into this.

Jarvis323 said:
If you're not seeding the random number generator, then I would expect it to be exactly the same weights every time. Seeding the random number generator (with a different number each time you run the program, e.g. with the time) is how you would get different ones.
I believe Numpy automatically reseeds the random number generator, probably based on the current time, or something like that.
 
Last edited:
  • #19
Jarvis323 said:
One thing that I noticed is that your derivative code doesn't match your math.
The math is showing the gradient applied to the ##m^{th}## example. The code is applying the gradient to the entire input matrix. At least, that's what I think it's doing.

O you meant for the activation function. I believe I made a typo in the math. Good catch.

Jarvis323 said:
No need to stress over a possible math error, when you probably have some simple coding errors anyway.

I'm thinking the math is most likely correct, because when I start rearranging things, the matrix multiplications fail due to the operands dimensions not matching up.
 
Last edited:
  • #20
:cry:
 
  • #21
I made the recommended changes and tried to make it a little more readable, but I still cannot find out where there is an error. I'm thinking it must be a mathematical error, because the code is very simple, and I don't see any error in the code.
 
  • #22
Try flow charting it.
The attention to detail required along with the parallel processing of your visual channel is surprisingly effective.
 
  • #23
Python:
class FeedForwardNeuralNetwork( FeedForwardNeuralNetwork  ) :
  
    def __forwardpropagation( self, X : Input_Matrix, A : Dict, Z : Dict ) -> Output_Matrix :
        ...
        for layer_n in range( 1, len( self.weights ) + 1 ) :
            Z[ layer_n ] = np.matmul( A[ layer_n - 1 ],  self.weights[ layer_n ]  ) + self.biases[ layer_n ]
            ...
It looks like you have some off by 1 errors. The loop goes over layer_n = 1 to len(self.weights). On the last iteration, when layer_n=len(weights), you are accessing self.weights[ len( self.weights ) ].

This should give an error, list index out of bounds, which makes me think that either len( self.weights )=0, so the loop isn't even entered, or __forwardpropagation is never actually called. Maybe the first loop condition in the train function, np.sqrt( totgrad ) > convergence, is never true?

After fixing your indexing errors, answer some questions.

Have you tried tracing the execution? What happens? Is it even going into the loops? At which point does something unexpected happen?

If you have to, input something simple, so you know what the results should look like in each operation.

Edit:

Also, you have the same off by 1 error in train, again trying to access weights[ len(weights) ], which tells us again that the code in that inner loop is never executed (because if it was, you would get an error and the program would halt).

Or maybe your program is halting with errors and you're not noticing it?

Also, why are weights[ 0 ] and biases[ 0 ] left undefined?
 
Last edited:
  • #24
Yea. It might be a little confusing, but I'm not using lists to store the weights and biases. I'm using two dictionaries with keys that are integer values. The keys are the layer numbers. The input layer is the zeroth layer and does not have any weights or biases. Hence why there is no weights[0] or biases[0].

I decided to use a dictionary, because it's easier to populate and repopulate. Also, the math I derived is treating the input layer as ##l = 0##, and the output layer as ##l = number\ of\ hidden\ layers + 1##. So, I wanted to follow that convention.

Also, when storing the weighted input of each layer ##Z## and the output of each layer ##A##, it is easier to follow mathematically if they are dictionaries with layer number as keys. That's why I decided not to use ##i## in the for loops.

So, there are no lists in this program lol, except for the keyword argument "hidden_layer."
 
Last edited:
  • #25
I figured out why the program outputs an array of identical vectors, as seen in post #13.

It is because the weighted input to hidden layer 1 is a matrix of large numbers, due to the dot product of a large number of features, so that the output of layer 1 is an array of nothing but 1s, because the sigmoid activation function returns approximately 1 for any number larger than like 5. The weighted input matrix to the first hidden layer has values of around 50. Then, when the next layer receives this array of 1s, it applies its weights and activation function and outputs an array of identical vectors, and so on.
 
Last edited:
  • #26
Zap said:
Yea. It might be a little confusing, but I'm not using lists to store the weights and biases. I'm using two dictionaries with keys that are integer values. The keys are the layer numbers. The input layer is the zeroth layer and does not have any weights or biases. Hence why there is no weights[0] or biases[0].

I decided to use a dictionary, because it's easier to populate and repopulate. Also, the math I derived is treating the input layer as ##l = 0##, and the output layer as ##l = number\ of\ hidden\ layers + 1##. So, I wanted to follow that convention.

Also, when storing the weighted input of each layer ##Z## and the output of each layer ##A##, it is easier to follow mathematically if they are dictionaries with layer number as keys. That's why I decided not to use ##i## in the for loops.

So, there are no lists in this program lol, except for the keyword argument "hidden_layer."
Ahh, I see. Yeah, that threw me off.
 
  • #27
Zap said:
I decided to use a dictionary, because it's easier to populate and repopulate.
I think that one of the things you should take away from this exercise is that that was not a good decision.
 
  • #28
pbuk said:
I think that one of the things you should take away from this exercise is that that was not a good decision.
Why though?

I'm almost positive it's a mathematical error, now. I suppose using dicts with interger keys may be a little confusing, but it follows the math notation, so I'll prob stick to it until the program actually starts working. Then, maybe I'll go in and change the dicts to lists. Maybe I'll include an input layer into the lists of weights and biases, so it will still follow the math notation.

For now, I have to compare the math that I derived and used with other sources. Hopefully, I find another source that is also treating ##\delta## as a matrix, instead of a vector, or at least one that explains why ##\delta## should be a vector and not a matrix.

The strange thing is that the program does minimize the cost function, or at least seems to minimize it. Somehow, it achieves a model with a smaller cost than what the SkLearn module produces. However, the output is complete nonsense. It's just a matrix of identical vectors, and every vector is composed of nothing by zeros. Somehow, this is a better model, according to the sum of squared residuals.

I am applying the model to the MNIST training set, which is composed of 60,000 training examples. My erroneous model achieves a cost of 59,999, which makes sense if it outputs a matrix of nothing but zeros. I don't think it's an issue of overfitting, because if I decrease the training time or number of iterations, the cost goes up and it still outputs a matrix of identical vectors. The SkLearn models cost is typically around 70,000-100,000, but are actual working models.

It's kind of funny that a nonsense model that outputs nothing but zeros is a better model according to the cost function lol. wish this were just an issue of over training, but it's not. I'm convinced the math is wrong.
 
Last edited:
  • #29
I'm actually reading that using integer values as keys for a dictionary improves performance. Why is this a bad idea in this context? When using a list, the code becomes less readable, in my opinion, due to the way lists are indexed and that each element in the list must be contiguous. This puts constraints on how you can represent the network, and forces you to represent it in such a way that is not at all intuitive, mathematically.

For example, the weighted input to the zeroth layer (input layer) does not exist. So what would you put at index 0 for the list of weighted inputs to each layer? It would also be strange to include an input layer in the list of weights, since the input layer does not have any weights. So, to avoid having a different index for the weights, the weighted inputs and the outputs of each layer, and have none of them follow the mathematical notation, I decided to use a dictionary.
 
Last edited:
  • #30
Zap said:
performance

Why are you worried about performance before you have gotten the right answer? If a fast but wrong answer is acceptable, I can show you some very fast algorithms.
 
  • #31
I'm just asking why I should not be using a dictionary to store weights and biases for the neural network.

So, I am confident I am dealing with a mathematical issue, which stems from my definition of the gradient. I'm still not exactly sure why it's wrong. But, I think that is where the issue is coming from.
 
  • #32
In all internet sources, the gradient is being treated as a vector, where I am treating it as a matrix.

I treat it as a matrix, because the cost function is a function of a matrix. It is also a function of the activation values of the output layer for every example in the input matrix, not just for a single vector example.

However, for some reason, the internet is telling me that the gradient should be a vector, and should be computed only for a single vector example at a time.

That's why my equations look different that what is found on internet recourses, and I'm assuming that that is why my neural net is not working. But, why can't the gradient be computed as a matrix, considering all of the vectors in the input matrix, for which the cost function is a function of? And what is the point of batches if you can only apply the gradient per input vector?

What is the reason for this restriction? Why can't the gradient be a matrix, as I've defined in the link in the OP?
 
  • #33
Zap said:
I'm just asking why I should not be using a dictionary to store weights and biases for the neural network.
  1. Because the data you want to represent has all of the characteristics of a list and none of the characteristics of a dictionary.
  2. Because at least 2 people reading this thread have assumed you were using a list; when you return to this code in 12 months time so will you.
  3. Because for..in on a dictionary is not guaranteed to traverse the keys in numerical order.
  4. ...

Zap said:
I'm actually reading that using integer values as keys for a dictionary improves performance.
Against using non-integer values maybe, because the hashing function can be cheaper, but a list doesn't need a hashing function so is always faster.

An even better solution would be a numpy.array.
  1. It is the 'standard tool' for writing code like this.
  2. It gives you concise access to code implementing mathematical vector operations e.g. overloaded multiplication operators: your code will look a lot more like the maths.
  3. Because list traversals are implemented in C they will always be quicker than userland code so you will not have to refactor for optimum performance.
  4. ...
 
Last edited:
  • #34
I think you should have made it clearer that you are not attempting to implement the backpropagation algorithm on the wikipedia page, you are trying to implement an algorithm you have invented yourself. This algorithm will never work as it is based on a fundamental misunderstanding:

Zap said:
In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set

It is not. It is the difference between the current values of the coefficients in layer ## l ## and their unknown optimum values.

Here is my recommendation:
  1. study the subject so that you understand how it works; I would never recommend Wikipedia for this, in this case neuralnetworksanddeeplearning.com is a great free resource (which looks to have been quoted wholesale in Wikipedia, but why rely on the accuracy of that quote when you can go to the original?)
  2. study how scikit-learn does it https://github.com/scikit-learn/sci...earn/neural_network/_multilayer_perceptron.py
  3. implement the standard algorithm (or some other proven algorithm) using Python structures designed for the purpose (with the layers represented by a list of numpy.arrays)
Once you have done this you can start designing and implementing your own 'improved' supervised learning algorithm.
 
  • Like
Likes Vanadium 50
  • #35
pbuk said:
  1. Because the data you want to represent has all of the characteristics of a list and none of the characteristics of a dictionary.
  2. Because at least 2 people reading this thread have assumed you were using a list; when you return to this code in 12 months time so will you.
  3. Because for..in on a dictionary is not guaranteed to traverse the keys in numerical order.
  4. ...
Against using non-integer values maybe, because the hashing function can be cheaper, but a list doesn't need a hashing function so is always faster.

An even better solution would be a numpy.array.
  1. It is the 'standard tool' for writing code like this.
  2. It gives you concise access to code implementing mathematical vector operations e.g. overloaded multiplication operators: your code will look a lot more like the maths.
  3. Because list traversals are implemented in C they will always be quicker than userland code so you will not have to refactor for optimum performance.
  4. ...
Understood. When using Numpy, I'd have to create an array of lists, where each list contains an array of weights, so that each element in the outermost array has the same dimensions.
 

Similar threads

  • Programming and Computer Science
Replies
2
Views
1K
  • Programming and Computer Science
Replies
3
Views
1K
  • Programming and Computer Science
Replies
1
Views
2K
  • STEM Academic Advising
Replies
1
Views
1K
Replies
5
Views
1K
  • Calculus and Beyond Homework Help
Replies
1
Views
1K
  • Sticky
  • Programming and Computer Science
Replies
13
Views
4K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
628
Replies
2
Views
92
  • Programming and Computer Science
Replies
1
Views
1K
Back
Top