# Neural Network Not Working

• Zap
I think you should have made it clearer that you are not attempting to implement the backpropagation algorithm on the wikipedia page, you are trying to implement an algorithm you have invented yourself. This algorithm will never work as it is based on a fundamental misunderstanding:

It is not. It is the difference between the current values of the coefficients in layer ## l ## and their unknown optimum values.

Here is my recommendation:
1. study the subject so that you understand how it works; I would never recommend Wikipedia for this, in this case neuralnetworksanddeeplearning.com is a great free resource (which looks to have been quoted wholesale in Wikipedia, but why rely on the accuracy of that quote when you can go to the original?)
2. study how scikit-learn does it https://github.com/scikit-learn/sci...earn/neural_network/_multilayer_perceptron.py
3. implement the standard algorithm (or some other proven algorithm) using Python structures designed for the purpose (with the layers represented by a list of numpy.arrays)
Once you have done this you can start designing and implementing your own 'improved' supervised learning algorithm.
I'm trying to implement the backpropagation algorithm, I've just derived that algorithm myself instead of just copying some stuff from a website and plugging and chugging it.

When I take the derivative of the cost function with respect to one of the weights in the output layer, I get a gradient ##\nabla_{a}C## that is a matrix after applying the general chain rule on the cost function. For some reason, the algorithm found in http://neuralnetworksanddeeplearning.com and in wikipedia is treating the same gradient ##\nabla_{a}C## as a vector. This doesn't make any sense to me, unless they are applying the cost function to one example at a time. But why are they doing that? The cost function is a function of all examples in the output matrix.

It will take some extra effort on my part, but I can show you how I derive ##\nabla_{a}C## as a matrix, simply by applying the chain rule when trying to take the derivative of the cost ##C## with respect to one of the weights in the output layer. ##\nabla_{a}C## is a vector only if you are treating the cost function ##C## as being a function of only one output vector. I could make a separate thread about this, because I believe I have now identified this as being the issue in my program. However, I don't understand why it's an issue, since ##C## is a function of the output matrix, not only an output vector.

Last edited:
As someone already mentioned, what you're trying to do is implement batch training. If you google that you should find the info you need pretty easily.

When you train in batches, you accumulate the gradients from each instance before back propagation.

• pbuk
So, if ##\nabla_{a}C## can indeed be a matrix, then my math is probably not wrong. Also, the deeplearning website claims to be doing batch training, but it still treats ##\nabla_{a}C## as a vector.

Last edited:
This doesn't make any sense to me, unless they are applying the cost function to one example at a time.
That is exactly what 'they' are doing. Quoting from Michael Nielsen's online book near here:
As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, ## C=C_x##. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

But why are they doing that? The cost function is a function of all examples in the output matrix.
Because that is how this algorithm works: it calculates the gradient of the cost function for a single example. We average over all the ## m ## training examples ## x ## (in the mini-batch) in the gradient descent step:

\begin{align} w^l & \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T \\ b^l & \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l} \end{align}

Quoting again from Nielsen:
Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δl and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.
Again my recommendation to you: study Nielsen's maths, algorithm and his reference Python implementation. Once you have a thorough understanding you will be in a position to judge whether calculating ## \nabla_{x,\delta}C ## instead of ## \frac\eta m \Sigma_{x} \nabla_{\delta} C_x ## is a good idea.

• Because that is how this algorithm works

Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.

$$\nabla^{(ℓ)}_{\Omega}C= \begin{bmatrix} \frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}} \end{bmatrix}$$

If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread here.

It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.

So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?

It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.

Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.

I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.

Nielsen says
... what backpropagation actually let's us do is compute the partial derivatives ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for a single training example. We then recover ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}## by averaging over training examples.
but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.

Last edited:
Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.

$$\nabla^{(ℓ)}_{\Omega}C= \begin{bmatrix} \frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}} \end{bmatrix}$$

If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread here.

It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.

So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?

It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.

Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.

I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.

Nielsen says

but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.
I think you might be getting confused by focusing on the linear algebra before understanding the goal, the topology, and the function that you're trying to learn.

The point of the neural network is not to learn a function that maps a set of elements to a set of targets. The point is to learn a function that maps an instance to its output, that generalizes so that the same function works for new unseen instances. In other words, its modeling ##f(\mathbf{x})## not ##f(\mathbf{X})##

Think about the topology of the network and the function it computes. It doesn't have multiple examples and multiple targets. It's a function that maps a single instance to a single output. You don't have separate edges for different instance outputs. So how could you back propagate the errors of all instances at once (through the same edges) without accumulating them?

The implementation might support applying that function to a bunch of examples in one call, but it's the same function being applied to each example independently.

Last edited:
• Zap and pbuk
I think you might be getting confused by focusing on the linear algebra before understanding the goal, the topology, and the function that you're trying to learn.

The point of the neural network is not to learn a function that maps a set of elements to a set of targets. The point is to learn a function that maps an instance to its output, that generalizes so that the same function works for new unseen instances.

Think about the topology of the network and the function it computes. It doesn't have multiple examples and multiple targets. It's a function that maps a single instance to a single output. You don't have separate edges for different instance outputs. So how could you back propagate the errors of all instances at once (through the same edges) without accumulating them?

The implementation might support applying that function to a bunch of examples in one call, but it's the same function being applied to each example independently.

Hmm ... I think we are really pin pointing why my neural network does not work right now. It was neither a coding nor a mathematical error. It was more like a misconception of what a neural network is ... That's actually pretty amazing. I'm not entirely understanding what you're saying here. I think I'll need some time to think about it further. Thank you for working this out with me.

I think what you've posted needs to be explicitly stated, because it's not obvious why ##\nabla_{a}C## needs to be a vector, if we are treating the algorithm as a math problem. We've dissected here that it's not exactly a straight forward math problem. There's something conceptual here, that is not found by simply solving for the gradient of the cost function with respect to the weights of the neural network, and is not obviously implied by simply stating that ##\nabla_{a}C## must be a vector.

Maybe check out this video? See if it helps conceptually?

• Tom.G and pbuk
It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking. For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals. It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.

It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking. For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals. It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.
I guess it's just a naming conventions.

Loss function is usually a function defined on a data point, prediction and label, and measures the penalty.
...
Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization).
...
Long story short, I would say that:

A loss function is a part of a cost function which is a type of an objective function.

https://stats.stackexchange.com/que...unction-loss-function-are-they-the-same-thing

I think what you've posted needs to be explicitly stated, because it's not obvious why ##\nabla_{a}C## needs to be a vector.
What @Jarvis323 posted is obvious to me because what we are trying to do is change each of the coefficients so that the network performs better. For each coefficient we therefore need a single number to adjust it by, not a column of a matrix.

It's still a bit confusing referring to a cost function as something that is a total average of individual costs. That's not exactly what a cost function is, generally speaking.
According to whom?

For example, in multiple linear regression, the cost function is not an average of individual costs, when using the sum of squared residuals.
What do you think the word 'mean' signifies in 'mean squared error'?

It would be better to refer to it explicitly as an average cost, or something along those lines, to avoid misinterpretation.
Look, you've misunderstood something. There's nothing wrong with that, this is a complicated topic and to avoid what Nielsen calls 'index hell' he takes a few short cuts with the notation and terminology. What is wrong is complaining that everyone else is wrong for not explaining it so you understood it first time round. That is not a mature attitude to learning.

Last edited:
• What do you think the word 'mean' signifies in 'mean squared error'?

I was talking about the residual sum of squares, not the mean squared error.

Look, you've misunderstood something. There's nothing wrong with that, this is a complicated topic and to avoid what Nielsen calls 'index hell' he takes a few short cuts with the notation and terminology. What is wrong is complaining that everyone else is wrong for not explaining it so you understood it first time round. That is not a mature attitude to learning.

I'm not sure where you're coming from with this. I don't think I've complained or done anything wrong, here. It was never explained why the gradient had to be a vector until post #41. Not that I require anyone to respond with an explanation, but I will continue asking questions when responded to as much as I like. If I can't ask questions, then there's no point in being on this forum, and the mods should promptly ban me.

A loss or cost function in regression analysis is not the same as the loss/cost function that is described here. They are not the same, because the approach that I've taken here will work for something like a linear regression. I think that someone could easily make that mistake. That's my opinion. It's not me having an attitude. I think it's disrespectful of you and it's irrelevant to the thread to post something like that. It contributes absolutely nothing and attempts to attack my character.

Last edited:
I think it's disrespectful of you and it's irrelevant to the thread to post something like that. It contributes absolutely nothing and attempts to attack my character.
I don't think it should be read this way. It is generally difficult to communicate non visually. People tend to see only their own perspective and disregard the fact, that such a perspective might not necessarily be the one of others. If you feel misunderstood, then try to explain yourself as patient as others try to help you. We are at post #48 and maybe people are running out of ideas where the problem actually can be located.

@all: Please try to remain on topic.

The problem was resolved at post #41. I do appreciate everyone who helped out and were not condescending.

The problem was resolved at post #41. I do appreciate everyone who helped out and were not condescending.
O.k., then let's move forward. Thread closed.