- #1

- 398

- 100

- Summary:
- Deriving the gradient of the cost function ##\nabla_{a}C##.

This is an issue I've been stuck on for about two weeks. No matter how many times I take this derivative, I keep getting the same answer. However, this answer is inevitably wrong. Please help me to understand why it incorrect.

To start, I will define an input matrix ##X##, where ##n## is the number of features or independent variables in the neural network model, and ##m## is the number of examples.

$$

X = \begin{bmatrix}

x_{1,1} & x_{1,2} & \dots & x_{1,n} \\

x_{2,1} & x_{2,2} & \dots & x_{2,n} \\

\vdots & \vdots & \ddots & \vdots \\

x_{m,1} & x_{m,2} & \dots & x_{m,n}

\end{bmatrix}

$$

Next, I will define an output matrix, which is the output of a feedforward neural network after forward propagation of the input matrix ##X##. In other words, this is the model's output ##\hat{Y}##, where ##m## is the number of examples in the input matrix, and ##k## is the number of perceptrons in the output layer.

$$

\hat{Y} = \begin{bmatrix}

\hat{y}_{1,1} & \hat{y}_{1,2} & \dots & \hat{y}_{1,k} \\

\hat{y}_{2,1} & \hat{y}_{2,2} & \dots & \hat{y}_{2,k} \\

\vdots & \vdots & \ddots & \vdots \\

\hat{y}_{m,1} & \hat{y}_{m,2} & \dots & \hat{y}_{m,k} \\

\end{bmatrix}

$$

The output matrix can also be written as a matrix of the activation or output of each perceptron in the output layer for each element ##/hat{y}_{m,k}## in the output vector for all examples ##m##, as seen below, where ##a^{(ℓ)}_{m,k}## is the activation value of the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$

\hat{Y} = \begin{bmatrix}

a^{(ℓ)}_{1,1} & a^{(ℓ)}_{1,2} & \dots & a^{(ℓ)}_{1,k} \\

a^{(ℓ)}_{2,1} & a^{(ℓ)}_{2,2} & \dots & a^{(ℓ)}_{2,k} \\

\vdots & \vdots & \ddots & \vdots \\

a^{(ℓ)}_{m,1} & a^{(ℓ)}_{m,2} & \dots & a^{(ℓ)}_{m,k}

\end{bmatrix}

$$

There is also a target matrix ##Y## of the same form as ##/hat{Y}##, where ##m## is the number of examples, and ##k## is the number of dependent variables in the neural network model.

$$

Y = \begin{bmatrix}

y_{1,1} & y_{1,2} & \dots & y_{1,k} \\

y_{2,1} & y_{2,2} & \dots & y_{2,k} \\

\vdots & \vdots & \ddots & \vdots \\

y_{m,1} & y_{m,2} & \dots & y_{m,k} \\

\end{bmatrix}

$$

I will now define a cost function ##C(Y;\hat{Y})##, which uses the elements in the target matrix ##Y## as constant parameters and is a function of the output matrix ##\hat{Y}##.

$$C(Y;\hat{Y})$$

Let's now put our focus on the cost function's dependent variable, the output matrix ##\hat{Y}##, and make it more explicit that the cost function is a function of all the elements in the output matrix ##\hat{Y}##, as shown below.

$$C(\hat{y}_{1,1},...,\hat{y}_{1,k},...\hat{y}_{m,1},...\hat{y}_{m,k})$$

$$C(a^{(ℓ)}_{1,1} ,...,a^{(ℓ)}_{1,k},...a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Let's now define what is ##a^{(ℓ)}_{m,k}##. This is nothing but the value of the activation function ##f## given a weighted input ##z_{m,k}## at the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$a^{(ℓ)}_{m,k}=f(z_{m,k})$$

Let's now define ##z_{m,k}##, the weighted input, where ##a^{(ℓ-1)}_{m,i}## is the activation value of the ##i^{th}## perceptron in the hidden layer ##ℓ-1## for the ##m^{th}## example, ##j## is the number of perceptrons in the hidden layer ##ℓ-1##, and ##\omega_{i,k}## is the ##i^{th}## weight for the ##k^{th}## perceptron in the output layer ##ℓ##.

$$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$

So, we want to find the value of the derivative of the cost function with respect to a weight ##\omega^{(ℓ)}_{j,k}##, which is the ##j^{th}## weight of the ##k{th}## perceptron in the output layer ##ℓ##, denoted below.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=?$$

So, first thing we can do is treat all activations ##a^{(ℓ)}_{m,k}## without a ##k## subscript as constants, since ##\omega^{(ℓ)}_{j,k}## is only relevant to the ##k^{th}## perceptron in the output layer ##ℓ##.

$$C(a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Next, we can note that $$a^{(ℓ)}_{m,k}=f(z_{m,k})$$ and $$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$, so that

$$a^{(ℓ)}_{m,k}=f(\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k})$$

and we write ##a^{(ℓ)}_{m,k}## as a function of ##\omega^{(ℓ)}_{j,k}##, like ##a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k})##. Therefore

$$C(a^{(ℓ)}_{m,1}(\omega^{(ℓ)}_{j,k}) , ... a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k}) )$$

and we can invoke the generalized chain rule.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial a^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}} $$

Now, applying the chain rule once more, we can separate this value into two parts.

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial z^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$

$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = \frac{\partial a^{(ℓ)}_{h,k}}{\partial z^{(ℓ)}_{h,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Which, we can simply to

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$

$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = f^{'}(z^{(ℓ)}_{h,k}) \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Now, let us apply this result to all of the weights in the output layer ##ℓ## for all ##k## perceptrons, defined as ##\Omega^{(ℓ)}## below, and try to find the gradient ##\nabla^{(ℓ)}_{\Omega}C##.

$$

\Omega^{(ℓ)}=

\begin{bmatrix}

\omega_{1,1} & \dots & \omega_{1,k} \\

\vdots & \ddots & \vdots \\

\omega_{j,1} & \dots & \omega_{j,k}

\end{bmatrix}

$$

$$

\nabla^{(ℓ)}_{\Omega}C=

\begin{bmatrix}

\frac{\partial }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial }{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{j,k}}

\end{bmatrix}

$$

First, we simply plug our result in ##(1)## matrix.

$$

\nabla^{(ℓ)}_{\Omega}C=

\begin{bmatrix}

\sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \\

\vdots & \ddots & \vdots \\

\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}

\end{bmatrix}

$$

This is nothing but the matrix multiplication operation

$$

\nabla^{(ℓ)}_{\Omega}C=

\begin{bmatrix}

a^{(ℓ-1)}_{1,1} & \dots & a^{(ℓ-1)}_{m,1} \\

\vdots & \ddots & \vdots \\

a^{(ℓ-1)}_{1,j} & \dots & a^{(ℓ-1)}_{m,j} \\

\end{bmatrix}

\begin{bmatrix}

\frac{\partial C}{\partial z^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial C}{\partial z^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{m,k}} \\

\end{bmatrix}

$$

Which is simplified below, where ##L^{(ℓ-1)}## is the input to the output layer ##ℓ## and ##\nabla^{(ℓ)}_{z}C## is the gradient of the cost function with respect to the weighted input to the output layer.

$$

\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C

$$

Now, let's just focus on ##\nabla^{(ℓ)}_{z}C##. We can solve ##\nabla^{(ℓ)}_{z}C## by plugging in our result ##(2)##.

$$

\nabla^{(ℓ)}_{z}C= \begin{bmatrix}

f^{'}(z^{(ℓ)}_{1,1}) \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & f^{'}(z^{(ℓ)}_{1,k}) \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

f^{'}(z^{(ℓ)}_{m,1}) \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & f^{'}(z^{(ℓ)}_{m,k}) \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}

\end{bmatrix}

$$

We can split this into two matrices using the Hadamard product.

$$

\nabla^{(ℓ)}_{z}C= \begin{bmatrix}

f^{'}(z^{(ℓ)}_{1,1}) & \dots & f^{'}(z^{(ℓ)}_{1,k}) \\

\vdots & \ddots & \vdots \\

f^{'}(z^{(ℓ)}_{m,1}) & \dots & f^{'}(z^{(ℓ)}_{m,k})

\end{bmatrix}

⊙

\begin{bmatrix}

\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}

\end{bmatrix}

$$

And, lo and behold, the gradient ##\nabla^{(ℓ)}_{a}C## is a matrix, as shown.

$$

\nabla^{(ℓ)}_{a}C=

\begin{bmatrix}

\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}

\end{bmatrix}

$$

So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??

To start, I will define an input matrix ##X##, where ##n## is the number of features or independent variables in the neural network model, and ##m## is the number of examples.

$$

X = \begin{bmatrix}

x_{1,1} & x_{1,2} & \dots & x_{1,n} \\

x_{2,1} & x_{2,2} & \dots & x_{2,n} \\

\vdots & \vdots & \ddots & \vdots \\

x_{m,1} & x_{m,2} & \dots & x_{m,n}

\end{bmatrix}

$$

Next, I will define an output matrix, which is the output of a feedforward neural network after forward propagation of the input matrix ##X##. In other words, this is the model's output ##\hat{Y}##, where ##m## is the number of examples in the input matrix, and ##k## is the number of perceptrons in the output layer.

$$

\hat{Y} = \begin{bmatrix}

\hat{y}_{1,1} & \hat{y}_{1,2} & \dots & \hat{y}_{1,k} \\

\hat{y}_{2,1} & \hat{y}_{2,2} & \dots & \hat{y}_{2,k} \\

\vdots & \vdots & \ddots & \vdots \\

\hat{y}_{m,1} & \hat{y}_{m,2} & \dots & \hat{y}_{m,k} \\

\end{bmatrix}

$$

The output matrix can also be written as a matrix of the activation or output of each perceptron in the output layer for each element ##/hat{y}_{m,k}## in the output vector for all examples ##m##, as seen below, where ##a^{(ℓ)}_{m,k}## is the activation value of the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$

\hat{Y} = \begin{bmatrix}

a^{(ℓ)}_{1,1} & a^{(ℓ)}_{1,2} & \dots & a^{(ℓ)}_{1,k} \\

a^{(ℓ)}_{2,1} & a^{(ℓ)}_{2,2} & \dots & a^{(ℓ)}_{2,k} \\

\vdots & \vdots & \ddots & \vdots \\

a^{(ℓ)}_{m,1} & a^{(ℓ)}_{m,2} & \dots & a^{(ℓ)}_{m,k}

\end{bmatrix}

$$

There is also a target matrix ##Y## of the same form as ##/hat{Y}##, where ##m## is the number of examples, and ##k## is the number of dependent variables in the neural network model.

$$

Y = \begin{bmatrix}

y_{1,1} & y_{1,2} & \dots & y_{1,k} \\

y_{2,1} & y_{2,2} & \dots & y_{2,k} \\

\vdots & \vdots & \ddots & \vdots \\

y_{m,1} & y_{m,2} & \dots & y_{m,k} \\

\end{bmatrix}

$$

I will now define a cost function ##C(Y;\hat{Y})##, which uses the elements in the target matrix ##Y## as constant parameters and is a function of the output matrix ##\hat{Y}##.

$$C(Y;\hat{Y})$$

Let's now put our focus on the cost function's dependent variable, the output matrix ##\hat{Y}##, and make it more explicit that the cost function is a function of all the elements in the output matrix ##\hat{Y}##, as shown below.

$$C(\hat{y}_{1,1},...,\hat{y}_{1,k},...\hat{y}_{m,1},...\hat{y}_{m,k})$$

$$C(a^{(ℓ)}_{1,1} ,...,a^{(ℓ)}_{1,k},...a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Let's now define what is ##a^{(ℓ)}_{m,k}##. This is nothing but the value of the activation function ##f## given a weighted input ##z_{m,k}## at the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.

$$a^{(ℓ)}_{m,k}=f(z_{m,k})$$

Let's now define ##z_{m,k}##, the weighted input, where ##a^{(ℓ-1)}_{m,i}## is the activation value of the ##i^{th}## perceptron in the hidden layer ##ℓ-1## for the ##m^{th}## example, ##j## is the number of perceptrons in the hidden layer ##ℓ-1##, and ##\omega_{i,k}## is the ##i^{th}## weight for the ##k^{th}## perceptron in the output layer ##ℓ##.

$$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$

So, we want to find the value of the derivative of the cost function with respect to a weight ##\omega^{(ℓ)}_{j,k}##, which is the ##j^{th}## weight of the ##k{th}## perceptron in the output layer ##ℓ##, denoted below.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=?$$

So, first thing we can do is treat all activations ##a^{(ℓ)}_{m,k}## without a ##k## subscript as constants, since ##\omega^{(ℓ)}_{j,k}## is only relevant to the ##k^{th}## perceptron in the output layer ##ℓ##.

$$C(a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$

Next, we can note that $$a^{(ℓ)}_{m,k}=f(z_{m,k})$$ and $$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$, so that

$$a^{(ℓ)}_{m,k}=f(\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k})$$

and we write ##a^{(ℓ)}_{m,k}## as a function of ##\omega^{(ℓ)}_{j,k}##, like ##a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k})##. Therefore

$$C(a^{(ℓ)}_{m,1}(\omega^{(ℓ)}_{j,k}) , ... a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k}) )$$

and we can invoke the generalized chain rule.

$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial a^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}} $$

Now, applying the chain rule once more, we can separate this value into two parts.

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial z^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$

$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = \frac{\partial a^{(ℓ)}_{h,k}}{\partial z^{(ℓ)}_{h,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Which, we can simply to

$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$

$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = f^{'}(z^{(ℓ)}_{h,k}) \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$

Now, let us apply this result to all of the weights in the output layer ##ℓ## for all ##k## perceptrons, defined as ##\Omega^{(ℓ)}## below, and try to find the gradient ##\nabla^{(ℓ)}_{\Omega}C##.

$$

\Omega^{(ℓ)}=

\begin{bmatrix}

\omega_{1,1} & \dots & \omega_{1,k} \\

\vdots & \ddots & \vdots \\

\omega_{j,1} & \dots & \omega_{j,k}

\end{bmatrix}

$$

$$

\nabla^{(ℓ)}_{\Omega}C=

\begin{bmatrix}

\frac{\partial }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial }{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{j,k}}

\end{bmatrix}

$$

First, we simply plug our result in ##(1)## matrix.

$$

\nabla^{(ℓ)}_{\Omega}C=

\begin{bmatrix}

\sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \\

\vdots & \ddots & \vdots \\

\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}

\end{bmatrix}

$$

This is nothing but the matrix multiplication operation

$$

\nabla^{(ℓ)}_{\Omega}C=

\begin{bmatrix}

a^{(ℓ-1)}_{1,1} & \dots & a^{(ℓ-1)}_{m,1} \\

\vdots & \ddots & \vdots \\

a^{(ℓ-1)}_{1,j} & \dots & a^{(ℓ-1)}_{m,j} \\

\end{bmatrix}

\begin{bmatrix}

\frac{\partial C}{\partial z^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial C}{\partial z^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{m,k}} \\

\end{bmatrix}

$$

Which is simplified below, where ##L^{(ℓ-1)}## is the input to the output layer ##ℓ## and ##\nabla^{(ℓ)}_{z}C## is the gradient of the cost function with respect to the weighted input to the output layer.

$$

\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C

$$

Now, let's just focus on ##\nabla^{(ℓ)}_{z}C##. We can solve ##\nabla^{(ℓ)}_{z}C## by plugging in our result ##(2)##.

$$

\nabla^{(ℓ)}_{z}C= \begin{bmatrix}

f^{'}(z^{(ℓ)}_{1,1}) \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & f^{'}(z^{(ℓ)}_{1,k}) \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

f^{'}(z^{(ℓ)}_{m,1}) \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & f^{'}(z^{(ℓ)}_{m,k}) \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}

\end{bmatrix}

$$

We can split this into two matrices using the Hadamard product.

$$

\nabla^{(ℓ)}_{z}C= \begin{bmatrix}

f^{'}(z^{(ℓ)}_{1,1}) & \dots & f^{'}(z^{(ℓ)}_{1,k}) \\

\vdots & \ddots & \vdots \\

f^{'}(z^{(ℓ)}_{m,1}) & \dots & f^{'}(z^{(ℓ)}_{m,k})

\end{bmatrix}

⊙

\begin{bmatrix}

\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}

\end{bmatrix}

$$

And, lo and behold, the gradient ##\nabla^{(ℓ)}_{a}C## is a matrix, as shown.

$$

\nabla^{(ℓ)}_{a}C=

\begin{bmatrix}

\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\

\vdots & \ddots & \vdots \\

\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}

\end{bmatrix}

$$

So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??