Hi, Peter.
Peter said:
I have to say in passing that learning about ... or further ... getting a good understanding of ... the differentiation of functions/mappings from $$\mathbb{R}^n$$ to $$\mathbb{R}^p$$ ... is harder than I thought it would be ... ... :( ... ...
Though it may take some time to see, the case of mappings from $\mathbb{R}^{n}$ to $\mathbb{R}^{p}$ is the natural generalization of mappings from $\mathbb{R}^{n}$ to $\mathbb{R}$ (i.e., real-valued functions of several variables), which is itself the generalization of mappings from $\mathbb{R}\rightarrow\mathbb{R}$. I will do my best to outline this flow below.
The Derivative as a Linear Operator
The derivative a function (whether it be $f:\mathbb{R}\rightarrow\mathbb{R}$, $f:\mathbb{R}^{n}\rightarrow\mathbb{R},$ or $f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}$) at a point $a$ is the linear mapping that most closely resembles $f$ at $a$ (
Note: This is why the definition of the derivative (as you've posted previously) involves the term
$$\|f(x)-f(a)-Df(a)(x-a)\|,$$
because this term is the error the linear mapping has with respect to the true function values near $a$, and we want this error to be zero in the limit that $x\rightarrow a$).
Whatever the case may be ($f:\mathbb{R}\rightarrow\mathbb{R}$, $f:\mathbb{R}^{n}\rightarrow\mathbb{R},$ or $f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}$), since the derivative linear operator must approximate $f$ it must be a mapping whose domain and range copies of $\mathbb{R}$ are of the same dimension as the domain and range copies of $\mathbb{R}$ for $f$. Hence
$\begin{align*}
f:\mathbb{R}\rightarrow\mathbb{R}\qquad &\Longleftrightarrow\qquad Df:\mathbb{R}\rightarrow\mathbb{R}\\
f:\mathbb{R}^{n}\rightarrow\mathbb{R}\qquad &\Longleftrightarrow\qquad Df:\mathbb{R}^{n}\rightarrow\mathbb{R}\\
f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}\qquad &\Longleftrightarrow\qquad Df:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}
\end{align*}$
Computing Derivatives: Real-Valued Function of a Single Variable
Thinking in terms of matrices and linear operators in single-variable calculus is not typically emphasized because the "matrices" representing the derivative linear operator are all $1\times 1$ (i.e., numbers), see correspondence table above.
However, let's emphasize the matrix notation via an example. Take $f:\mathbb{R}\rightarrow\mathbb{R}$ to be $f(x)=x^{2}.$ The domain and range spaces are collections of $1\times 1$ column vectors. The derivative of $f$ at a point $a\in\mathbb{R}$ will be a linear mapping from $\mathbb{R}\rightarrow\mathbb{R}$, and so can be expressed expressed as a $1\times 1$ matrix:
$$Df(a)=[2a].$$
This matrix acts on a $1\times 1$ column vector $v\in\mathbb{R}$ by matrix multiplication to produce a column vector in the range copy of $\mathbb{R}$:
$$Df(a)v=[2a][v_{1}]=[2av_{1}].$$
Computing Derivatives: Real-Valued Function of Several Variables
Now we are thinking of functions $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$. From the table above, we know that the derivative will be a linear mapping from $\mathbb{R}^{n}$ to $\mathbb{R}$, which means that it can be represented by a $1\times n$ matrix.
For example, take $f(x):\mathbb{R}^{3}\rightarrow\mathbb{R}$ to be defined as $f(x)=f(x_{1},x_{2},x_{3})=x_{1}^{2}+2x_{2}x_{3}+x_{3}^{2}$ and note $v\in\mathbb{R}^{3}$ is the column vector
$$
v=
\begin{bmatrix}
v_{1}\\
v_{2}\\
v_{3}
\end{bmatrix}.
$$
This next step is critical: The components of the $1\times 3$ derivative matrix are obtained by doing single-variable calculus. In other words, the first component is obtained by doing single-variable calculus on $f$ with respect to $x_{1}$ (i.e., differentiating with respect to $x_{1}$ only and thinking of $x_{2}$ and $x_{3}$ as constants), the second component is obtained by doing single-variable calculus on $f$ with respect to $x_{2}$ (i.e., differentiating with respect to $x_{2}$ only and thinking of $x_{1}$ and $x_{3}$ as constants), and the third component is obtained by doing single-variable calculus on $f$ with respect to $x_{3}$ (i.e., differentiating with respect to $x_{3}$ only and thinking of $x_{1}$ and $x_{2}$ as constants). Hence,
$$Df(x) =
\begin{bmatrix}
2x_{1}& 2x_{3}& 2x_{2}+2x_{3}
\end{bmatrix},
$$
or, using $x\mapsto a$,
$$Df(a) =
\begin{bmatrix}
2a_{1}& 2a_{3}& 2a_{2}+2a_{3}
\end{bmatrix}.
$$
Now, this derivative operator acts on $v$ through matrix multiplication to produce a $1\times 1$ column vector in $\mathbb{R}$:
$$Df(a)v=
\begin{bmatrix}
2a_{1}& 2a_{3}& 2a_{2}+2a_{3}
\end{bmatrix}
\begin{bmatrix}
v_{1}\\
v_{2}\\
v_{3}
\end{bmatrix}
=[2a_{1}v_{1}+2a_{3}v_{2}+(2a_{2}+2a_{3})v_{3}].
$$
Again, the key here is that each
column of the above $1\times n$ derivative matrix is obtained by doing single-variable calculus.
Computing Derivatives: Vector-Valued Functions of Several Variables
Now we are considering $f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}.$ From our table above, we know that we must eventually construct a linear mapping from $\mathbb{R}^{n}$ to $\mathbb{R}^{p}$; i.e., a matrix of dimension $p\times n$.
This case is actually no different from the one before. All that is happening is that we now have $p$ different functions of $n$ variables that we collect in a column vector of height $p$:
$$f(x)=f(x_{1},x_{2},\ldots, x_{n})=
\begin{bmatrix}
f_{1}(x)\\
f_{2}(x)\\
\vdots\\
f_{p}(x)
\end{bmatrix}
=
\begin{bmatrix}
f_{1}(x_{1},x_{2},\ldots, x_{n})\\
f_{2}(x_{1},x_{2},\ldots, x_{n})\\
\vdots\\
f_{p}(x_{1},x_{2},\ldots, x_{n})
\end{bmatrix}.
$$
To simplify things in your mind, really try to see that all that we have here is $p$ different versions of the previous case (i.e., real-valued functions of several variables). For all intents and purposes, each function $f_{i}(x)$ in the above column vector has nothing to do with any of the other functions in the column vector.
To differentiate such an object, we go one entry at a time in the column vector, using the method of the previous section to differentiate each real-valued function of several variables. According to the previous section, the derivative of a real-valued function of several variables is a row vector obtained by doing single-variable calculus with respect to each of the variables. Symbolically,
$$Df_{i}(a)=
\begin{bmatrix}
\partial_{x_{1}}f_{i}(a) & \partial_{x_{2}}f_{i}(a) & \cdots & \partial_{x_{n}}f_{i}(a)
\end{bmatrix}.
$$
All that we do to form the $p\times n$ derivative matrix $Df(a)$ is assemble each of these individually obtained row vectors into a single matrix:
$$
Df(a)=
\begin{bmatrix}
\partial_{x_{1}}f_{1}(a) & \partial_{x_{2}}f_{1}(a) & \cdots & \partial_{x_{n}}f_{1}(a)\\
\partial_{x_{1}}f_{2}(a) & \partial_{x_{2}}f_{2}(a) & \cdots & \partial_{x_{n}}f_{2}(a)\\
\vdots & \vdots & \ddots & \vdots\\
\partial_{x_{1}}f_{p}(a) & \partial_{x_{2}}f_{p}(a) & \cdots & \partial_{x_{n}}f_{p}(a)
\end{bmatrix}.
$$
Here are two (hopefully) helpful ways you can use for intuitively carrying out the process of differentiating a function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}$:
1) You can think of "pushing" the derivative symbol $D$ onto the components of $f$:
$$D_{p\times n}f(a)=
\begin{bmatrix}
D_{1\times n}f_{1}(a)\\
D_{1\times n}f_{2}(a)\\
\vdots\\
D_{1\times n}f_{p}(a)
\end{bmatrix},
$$
where $D_{p\times n}$ is the derivative we want to compute and $D_{1\times n}$ is the derivative we know how to compute from the case of a real-valued function of several variables from the second "Computing Derivatives" section above.
2) Since you know you will need to compute a $p\times n$ matrix, across the top of your matrix title the $n$-columns by $x_{1}, x_{2},\ldots, x_{n}$ (or possibly $\partial _{x_{1}}$, $\partial_{x_{2}}$, $\ldots,$ $\partial_{x_{n}}$), and the $p$ rows by $f_{1}$, $f_{2}$, $\ldots,$ $f_{p}$. By selecting a particular row you reduce the problem to considering a real-valued function of several variables. By selecting a column in this row, you reduce the problem to a real-valued function of a single variable. For example, if you select row $2$ and column $3$, you are now only dealing with the real-valued function $f_{2}(x_{1},x_{2},x_{3},\ldots, x_{n})$ as a function of the single-variable $x_{3}$ ($x_{1}, x_{2}, x_{4}, \ldots, x_{n}$ are thought of as constants). To obtain the $p\times n$ derivative matrix, perform the partial derivatives of the functions obtained by examining what row and column you are in.