Well, the algorithm claims to be calculating the gradient below, where ##\omega^{(ℓ)}_{j,k}## is the weight of the ##k^{th}## perceptron in the output layer ##ℓ## that is associated with the output of the ##j^{th}## perceptron in the hidden layer ##ℓ-1##.
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial C }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$
If you simply solve each of those derivatives in the matrix ##\nabla^{(ℓ)}_{\Omega}C##, you will get a matrix for ##\nabla_{a}C##, not a vector or an average of vectors. I've derived how that is done in a separate thread
here.
It's not clear to me why this is the wrong approach, or why ##\nabla_{a}C## must be a vector in order for backpropagation to work. Because if we are only interested in the partial derivatives and applying the chain rule to calculate these derivatives, a matrix ##\nabla_{a}C## inevitably results.
So, what is going on here? Why can't we simply calculate the derivatives and apply the chain rule? Why must it be done one example at a time so that ##\nabla_{a}C## can be a vector? I know that's how the algorithm works, but why doesn't it work with a matrix ##\nabla_{a}C##? Why do those explaining the algorithm claim to be interested in calculating the derivatives of and applying the chain rule to the cost function ##C(Y;\hat{Y})##, but end up using some alternative vector implementation of the cost function ##C(y;\hat{y})##, instead?
It's seems to me that something is being left out here. We aren't only interested in calculating the derivatives of the cost function with respect to the weights. If that were the case, ##\nabla_{a}C## would be a matrix. Something else is going on here.
Sorry if I'm repeating myself. I'm just trying to get my point across, and it's not exactly the simplest thing to convey in words.
I'm not really understanding Nielsen's approach. I can only understand my approach, which is just to calculate all the elements in ##\nabla^{(ℓ)}_{\Omega}C##, where ##C## is the actual cost function ##C(Y;\hat{Y})##.
Nielsen says
but why not just compute ##\frac{\partial C}{\partial \omega}## and ##\frac{\partial C}{\partial b}##, instead of "recovering" them by averaging ##\frac{\partial C_{x}}{\partial \omega}## and ##\frac{\partial C_{x}}{\partial b}## for multiple examples? Nielsen does not seem to explain this.