Why is my multilayered, feedforward neural network not working?

Zap · Feb 26, 2021

I'm just asking why I should not be using a dictionary to store weights and biases for the neural network.

So, I am confident I am dealing with a mathematical issue, which stems from my definition of the gradient. I'm still not exactly sure why it's wrong. But, I think that is where the issue is coming from.

Zap · Feb 27, 2021

In all internet sources, the gradient is being treated as a vector, where I am treating it as a matrix.

I treat it as a matrix, because the cost function is a function of a matrix. It is also a function of the activation values of the output layer for every example in the input matrix, not just for a single vector example.

However, for some reason, the internet is telling me that the gradient should be a vector, and should be computed only for a single vector example at a time.

That's why my equations look different that what is found on internet recourses, and I'm assuming that that is why my neural net is not working. But, why can't the gradient be computed as a matrix, considering all of the vectors in the input matrix, for which the cost function is a function of? And what is the point of batches if you can only apply the gradient per input vector?

What is the reason for this restriction? Why can't the gradient be a matrix, as I've defined in the link in the OP?

pbuk · Feb 27, 2021

Zap said:

I'm just asking why I should not be using a dictionary to store weights and biases for the neural network.

Because the data you want to represent has all of the characteristics of a list and none of the characteristics of a dictionary.
Because at least 2 people reading this thread have assumed you were using a list; when you return to this code in 12 months time so will you.
Because for..in on a dictionary is not guaranteed to traverse the keys in numerical order.
...

Zap said:

I'm actually reading that using integer values as keys for a dictionary improves performance.

Against using non-integer values maybe, because the hashing function can be cheaper, but a list doesn't need a hashing function so is always faster.

An even better solution would be a numpy.array.

It is the 'standard tool' for writing code like this.
It gives you concise access to code implementing mathematical vector operations e.g. overloaded multiplication operators: your code will look a lot more like the maths.
Because list traversals are implemented in C they will always be quicker than userland code so you will not have to refactor for optimum performance.
...

pbuk · Feb 27, 2021

I think you should have made it clearer that you are not attempting to implement the backpropagation algorithm on the wikipedia page, you are trying to implement an algorithm you have invented yourself. This algorithm will never work as it is based on a fundamental misunderstanding:

Zap said:

In the wiki page, ##\delta^{(l)}## is a vector, which is most likely the error of a single example in the data set

It is not. It is the difference between the current values of the coefficients in layer ## l ## and their unknown optimum values.

Here is my recommendation:

study the subject so that you understand how it works; I would never recommend Wikipedia for this, in this case neuralnetworksanddeeplearning.com is a great free resource (which looks to have been quoted wholesale in Wikipedia, but why rely on the accuracy of that quote when you can go to the original?)
study how scikit-learn does it https://github.com/scikit-learn/sci...earn/neural_network/_multilayer_perceptron.py
implement the standard algorithm (or some other proven algorithm) using Python structures designed for the purpose (with the layers represented by a list of numpy.arrays)

Once you have done this you can start designing and implementing your own 'improved' supervised learning algorithm.

Zap · Feb 27, 2021

pbuk said:

Because the data you want to represent has all of the characteristics of a list and none of the characteristics of a dictionary.

Because at least 2 people reading this thread have assumed you were using a list; when you return to this code in 12 months time so will you.

Because for..in on a dictionary is not guaranteed to traverse the keys in numerical order.

...

Against using non-integer values maybe, because the hashing function can be cheaper, but a list doesn't need a hashing function so is always faster.

An even better solution would be a numpy.array.

It is the 'standard tool' for writing code like this.

It gives you concise access to code implementing mathematical vector operations e.g. overloaded multiplication operators: your code will look a lot more like the maths.

Because list traversals are implemented in C they will always be quicker than userland code so you will not have to refactor for optimum performance.

...

Understood. When using Numpy, I'd have to create an array of lists, where each list contains an array of weights, so that each element in the outermost array has the same dimensions.

Zap · Feb 27, 2021

pbuk said:

I think you should have made it clearer that you are not attempting to implement the backpropagation algorithm on the wikipedia page, you are trying to implement an algorithm you have invented yourself. This algorithm will never work as it is based on a fundamental misunderstanding:
It is not. It is the difference between the current values of the coefficients in layer ## l ## and their unknown optimum values.

Here is my recommendation:

study the subject so that you understand how it works; I would never recommend Wikipedia for this, in this case neuralnetworksanddeeplearning.com is a great free resource (which looks to have been quoted wholesale in Wikipedia, but why rely on the accuracy of that quote when you can go to the original?)

study how scikit-learn does it https://github.com/scikit-learn/sci...earn/neural_network/_multilayer_perceptron.py

implement the standard algorithm (or some other proven algorithm) using Python structures designed for the purpose (with the layers represented by a list of numpy.arrays)

Once you have done this you can start designing and implementing your own 'improved' supervised learning algorithm.

I'm trying to implement the backpropagation algorithm, I've just derived that algorithm myself instead of just copying some stuff from a website and plugging and chugging it.

When I take the derivative of the cost function with respect to one of the weights in the output layer, I get a gradient ##\nabla_{a}C## that is a matrix after applying the general chain rule on the cost function. For some reason, the algorithm found in http://neuralnetworksanddeeplearning.com and in wikipedia is treating the same gradient ##\nabla_{a}C## as a vector. This doesn't make any sense to me, unless they are applying the cost function to one example at a time. But why are they doing that? The cost function is a function of all examples in the output matrix.

It will take some extra effort on my part, but I can show you how I derive ##\nabla_{a}C## as a matrix, simply by applying the chain rule when trying to take the derivative of the cost ##C## with respect to one of the weights in the output layer. ##\nabla_{a}C## is a vector only if you are treating the cost function ##C## as being a function of only one output vector. I could make a separate thread about this, because I believe I have now identified this as being the issue in my program. However, I don't understand why it's an issue, since ##C## is a function of the output matrix, not only an output vector.

Jarvis323 · Feb 27, 2021

As someone already mentioned, what you're trying to do is implement batch training. If you google that you should find the info you need pretty easily.

When you train in batches, you accumulate the gradients from each instance before back propagation.

Zap · Feb 27, 2021

So, if ##\nabla_{a}C## can indeed be a matrix, then my math is probably not wrong. Also, the deeplearning website claims to be doing batch training, but it still treats ##\nabla_{a}C## as a vector.

pbuk · Feb 27, 2021

Zap said:

This doesn't make any sense to me, unless they are applying the cost function to one example at a time.

That is exactly what 'they' are doing. Quoting from Michael Nielsen's online book near here:

As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, ## C=C_x##. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

Zap said:

But why are they doing that? The cost function is a function of all examples in the output matrix.

Because that is how this algorithm works: it calculates the gradient of the cost function for a single example. We average over all the ## m ## training examples ## x ## (in the mini-batch) in the gradient descent step:

$$ \begin{align} w^l & \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T \\
b^l & \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l} \end{align} $$

Quoting again from Nielsen:

Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δl and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.

Again my recommendation to you: study Nielsen's maths, algorithm and his reference Python implementation. Once you have a thorough understanding you will be in a position to judge whether calculating ## \nabla_{x,\delta}C ## instead of ## \frac\eta m \Sigma_{x} \nabla_{\delta} C_x ## is a good idea.

Zap · Feb 27, 2021

pbuk said:

Because that is how this algorithm works

Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.

$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$

If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread here.

It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.

So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?

It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.

Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.

I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.

Nielsen says

... what backpropagation actually let's us do is compute the partial derivatives ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for a single training example. We then recover ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}## by averaging over training examples.

but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.

Jarvis323 · Feb 28, 2021

Zap said:

Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.

$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$

If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread here.

It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.

So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?

It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.

Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.

I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.

Nielsen says

but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.

I think you might be getting confused by focusing on the linear algebra before understanding the goal, the topology, and the function that you're trying to learn.

The point of the neural network is not to learn a function that maps a set of elements to a set of targets. The point is to learn a function that maps an instance to its output, that generalizes so that the same function works for new unseen instances. In other words, its modeling ##f(\mathbf{x})## not ##f(\mathbf{X})##

Think about the topology of the network and the function it computes. It doesn't have multiple examples and multiple targets. It's a function that maps a single instance to a single output. You don't have separate edges for different instance outputs. So how could you back propagate the errors of all instances at once (through the same edges) without accumulating them?

The implementation might support applying that function to a bunch of examples in one call, but it's the same function being applied to each example independently.

Zap · Feb 28, 2021

Jarvis323 said:

I think you might be getting confused by focusing on the linear algebra before understanding the goal, the topology, and the function that you're trying to learn.

The point of the neural network is not to learn a function that maps a set of elements to a set of targets. The point is to learn a function that maps an instance to its output, that generalizes so that the same function works for new unseen instances.

Think about the topology of the network and the function it computes. It doesn't have multiple examples and multiple targets. It's a function that maps a single instance to a single output. You don't have separate edges for different instance outputs. So how could you back propagate the errors of all instances at once (through the same edges) without accumulating them?

The implementation might support applying that function to a bunch of examples in one call, but it's the same function being applied to each example independently.

Hmm ... I think we are really pin pointing why my neural network does not work right now. It was neither a coding nor a mathematical error. It was more like a misconception of what a neural network is ... That's actually pretty amazing. I'm not entirely understanding what you're saying here. I think I'll need some time to think about it further. Thank you for working this out with me.

Zap · Feb 28, 2021

I think what you've posted needs to be explicitly stated, because it's not obvious why ##\nabla_{a}C## needs to be a vector, if we are treating the algorithm as a math problem. We've dissected here that it's not exactly a straight forward math problem. There's something conceptual here, that is not found by simply solving for the gradient of the cost function with respect to the weights of the neural network, and is not obviously implied by simply stating that ##\nabla_{a}C## must be a vector.

Jarvis323 · Feb 28, 2021

Maybe check out this video? See if it helps conceptually?

Zap · Feb 28, 2021

It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking. For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals. It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.

Jarvis323 · Feb 28, 2021

Zap said:

It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking. For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals. It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.

I guess it's just a naming conventions.

Loss function is usually a function defined on a data point, prediction and label, and measures the penalty.
...
Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization).
...
Long story short, I would say that:

A loss function is a part of a cost function which is a type of an objective function.

https://stats.stackexchange.com/que...unction-loss-function-are-they-the-same-thing

pbuk · Mar 1, 2021

Zap said:

I think what you've posted needs to be explicitly stated, because it's not obvious why ##\nabla_{a}C## needs to be a vector.

What @Jarvis323 posted is obvious to me because what we are trying to do is change each of the coefficients so that the network performs better. For each coefficient we therefore need a single number to adjust it by, not a column of a matrix.

Zap said:

It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking.

According to whom?

Zap said:

For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals.

What do you think the word 'mean' signifies in 'mean squared error'?

Zap said:

It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.

Look, you've misunderstood something. There's nothing wrong with that, this is a complicated topic and to avoid what Nielsen calls 'index hell' he takes a few short cuts with the notation and terminology. What is wrong is complaining that everyone else is wrong for not explaining it so you understood it first time round. That is not a mature attitude to learning.

Zap · Mar 1, 2021

pbuk said:

What do you think the word 'mean' signifies in 'mean squared error'?

I was talking about the residual sum of squares, not the mean squared error.

pbuk said:

Look, you've misunderstood something. There's nothing wrong with that, this is a complicated topic and to avoid what Nielsen calls 'index hell' he takes a few short cuts with the notation and terminology. What is wrong is complaining that everyone else is wrong for not explaining it so you understood it first time round. That is not a mature attitude to learning.

I'm not sure where you're coming from with this. I don't think I've complained or done anything wrong, here. It was never explained why the gradient had to be a vector until post #41. Not that I require anyone to respond with an explanation, but I will continue asking questions when responded to as much as I like. If I can't ask questions, then there's no point in being on this forum, and the mods should promptly ban me.

A loss or cost function in regression analysis is not the same as the loss/cost function that is described here. They are not the same, because the approach that I've taken here will work for something like a linear regression. I think that someone could easily make that mistake. That's my opinion. It's not me having an attitude. I think it's disrespectful of you and it's irrelevant to the thread to post something like that. It contributes absolutely nothing and attempts to attack my character.

fresh_42 · Mar 1, 2021

Zap said:

I think it's disrespectful of you and it's irrelevant to the thread to post something like that. It contributes absolutely nothing and attempts to attack my character.

I don't think it should be read this way. It is generally difficult to communicate non visually. People tend to see only their own perspective and disregard the fact, that such a perspective might not necessarily be the one of others. If you feel misunderstood, then try to explain yourself as patient as others try to help you. We are at post #48 and maybe people are running out of ideas where the problem actually can be located.

@all: Please try to remain on topic.

Zap · Mar 1, 2021

The problem was resolved at post #41. I do appreciate everyone who helped out and were not condescending.

fresh_42 · Mar 1, 2021

Zap said:

The problem was resolved at post #41. I do appreciate everyone who helped out and were not condescending.

O.k., then let's move forward. Thread closed.

Why is my multilayered, feedforward neural network not working?

Similar threads

How to increase phone signal strength by lying about it

A Crisis for Newly Minted CompSci Majors -- entry level jobs gone

Who is responsible for the software when AI takes over programming?

Learning Assembly and computer architecture for x86

Learning data structures and algorithms in different programming languages

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers