Please convince me (OLS matrix derivation)

longrob · Oct 1, 2011

Hi all

In the derivation of the normal equations for Ordinary Least Squares estimates we have B (m x 1 column vector) and X (n x m matrix). Could someone please convince me that the derivative with respect to B of
B'X'XB
is
2X'XB

Thanks !
LR

Bacle · Oct 1, 2011

I'm sorry, I'm not familiar with the OLS you're referring to. The one I'm familiar is
one used either to find the best approximation (using orthogonal projection), to an
inconsistent system, or to find the line of best fit to a set of data points, i.e., the
least-squares line; the line that minimizes the sum of square residuals, and I don't
recognize your layout, nor the meaning of the derivative B'. Would you expand on your
layout?

longrob · Oct 1, 2011

Hi and thanks for your reply.
Could you take a look here:
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
On page 18/19 you see exactly what (I think) you are referring to in terms of the orthogonal projection. What I am referring to is on the bottom of page 19:
"Differentiating with respect to beta and setting to zero, we find that beta-hat satisfies"
It's this differentiation I am specifically talking about (of the 2nd term o the RHS)

Thanks again
LR

longrob · Oct 1, 2011

Excuse me:
I meant "the 3rd term in the expression above" not "the 2nd term on the RHS".

Bacle · Oct 1, 2011

Sorry, longrob, I have not had a chance to look at it, and I'm going to be busy for a while.

Still, from what I saw, it seems like the chain rule should be involved, since d/dB[(B'X'XB)]=

d/dB[((XB)'XB)] , and this is a product together with a chain rule; we have g(B)=(XB)'

and f(B)=XB (where X is a contant re B ), so that g(B) is a composition of scaling and

transposition. I'll look into it if I have a chance.

Bacle · Oct 1, 2011

I'm sorry, I have been a bit confused about this myself; the best I have been able
to come up with, has been this; maybe/hopefully you or someone else can finish this
and/or make it more precise:

In B'X'XB , X'X is a square nxn matrix, and, basically, B' is the same as B, only
expressed in a different format , so what you get is a quadratic form B'MB , or,
another approach is that d/dB(B'MB)= d/dB([BM]B]= MB+BM , and somehow
(despite the product not being commutative), this is equal to 2MB=2XX'B .

Sorry, I have not been able to figure it out better.

longrob · Oct 2, 2011

Hi Bacle, thanks for your messages. I'm glad I'm not the only one who is a bit confused by it.

For completeness and the benefit of others, I'll explain the setup so that it's not necessary to refer to the link I posted.

We have
y = XB + e

where y is a n x 1 column vector of responses
X is a n x (k+1) matrix with a leading column of 1s for the regression intercept and the remaining columns being nx1 vectors of the observed values of the explanatory variables.
B is a (k+1) x 1 column vector of regression coefficients.
e is a n x 1 column vector of errors

We wish to find the values for the elements of B which minimise the sum of the squares errors.
That is, we wish to minimise e'e
Now,
e'e = (y - XB)' (y - XB)

= (y'- B'X') (y - XB)

= y'y - y'XB - B'X'y + B'X'XB

Now, since y'XB and B'X'y are 1 x 1 , they are equal, so:

= y'y - 2y'XB + B'X'XB

Differentiating this with respect to B and setting to zero yields

X'XB = X'y

which are the so-calleed "normal equations" which upon rearrangement gives the least squares estimator for B (provided that X'X is invertible)

It is the last step above which I don't get - that the derivative of B'X'XB with respect to B is 2X'XB. Obviously B'X'XB is a quadratic form, so it does make some intuitive sense, but I need more than that...

Thanks for reading
LR

Bacle · Oct 2, 2011

Longrob, I do know the more geometric, and linear-algebraic derivation, using
orthogonal projection, if you're interested. You're working in a normed space (V,||.||).

The idea is that you're given an inconsistent linear system Ax=b; while there is no
actual solution x , there is a second best, which is finding the vector x^_
that best approximates the solution, i.e., to find x^_ such that, for any x':

||Ax'-b||>||Ax^_-b||

And this x^_ is actually the projection of b into the row space Ax
(if b were in the row space, Ax=b would have a solution).

With this layout, I know how to show that the line y=Ax^{^}+B^{^}
is the closest line to the collection of points in the LS sense (it is a sort-of inverse
problem , in that we are looking for the subspace--the least squares line/subspace
-- instead of having the space given as Ax).

Anyway, let me know if you're interested in this type of derivation. I got a bit
confused with the justification for differentiating matrices as if they were standard
variables, BTW; do you know if this is some sort of Borel Functional Calculus or
something similar to that?.

longrob · Oct 2, 2011

Hi again

I need to stick with the pure linear algebraic derivation at the moment, but thanks anyway. I may come back to you later on that, as I am also interested in the geometric interpretation.

Anyway, I think I have solved it. Basically, it revolves around the "rule" that the derivative of b'Ab with respect to b is (A+A')b, or in the case at hand, where A is symmetric, 2Ab. This can be proved as follows.

A is an n x n symmetric matrix. b is a n x 1 column vector.

\mathbf{b'Ab}=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset{j=1}{\sum}}b_{i}a_{ij}b_{j}

=\overset{n}{\underset{i=1}{\sum}}b_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i}{\sum}}b_{i}a_{ij}b_{j}

[sigh]
I can't get LaTex to display properly, as you can see above. I created it in Lyx and have tested it here http://www.codecogs.com/latex/eqneditor.php that it displays correctly. I can't post the rest of the proof in Latex as it just displays the Latex code:
\frac{\partial(\mathbf{\mathbf{b}'}\mathbf{Ab})}{\partial b_{k}}&=2b_{k}a_{kk}+2\overset{n}{\underset{i\neq k}{\sum}}a_{ki}b_{i}
What am I doing wrong with this Latex ?

The rest of the proof is to form the partial derivative with respect to a general b sub k and to notice that this is just 2 times the elements of the vector Ab, so the derivative with respect to the whole vector b is 2 times the vector Ab.

longrob · Oct 3, 2011

OK, here goes again. I think I fixed the LaTex problems:
let \mathbf{A=X^{\textrm{T}}X} where \mathbf{A} is a n\times n square symmetric matrix with elements a_{ij}. \mathbf{\mathbf{\boldsymbol{\beta}}} is the n\times 1 column vector. Expanding out.

\begin{align}\mathbf{\boldsymbol{\beta}^{\textrm{T}}A\boldsymbol{\beta}}&=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset {j=1}{\sum}}\beta_{i}a_{ij}\beta_{j}<br /> \end{align}

\begin{align}=\overset{n}{\underset{i=1}{\sum}} \beta_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i} {\sum}}\beta_{i}a_{ij}\beta_{j}\end{align}

Then, form the partial derivative of \mathbf{\mathbf{\mathbf{\boldsymbol{\beta}}^ {\textrm{T}}}\mathbf{A\boldsymbol{\beta}}} with respect to \beta_{k}, an arbitrary element of \mathbf{\mathbf{\boldsymbol{\beta}}}

\begin{align}\frac{\partial(\mathbf{\mathbf{\beta}^{\textrm{T}}}\mathbf{A\boldsymbol{\beta}})} {\partial\beta_{k}}&=2\beta_{k}a_{kk}+2 \overset{n}{\underset{i\neq k}{\sum}}a_{ki}\beta_{i}\end{align}

\begin{align}=2\overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i}\end{align}

But \overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i} is the kth element of the vector \mathbf{A}\boldsymbol{\beta} so it follows that

\begin{align}\frac{d( \mathbf{\mathbf{\beta}^{\textrm{T}}} \mathbf{A\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}&=2 \mathbf{A\boldsymbol{\beta}}\end{align}

Thus

\begin{align}\frac{d(\mathbf{\beta^{\textrm{T}} X^{\textrm{T}}X\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}= 2\mathbf{ {\mathbf{ X^{\textrm{T}}X\boldsymbol{\beta}}}} \end{align}

If anyone can check/correct I would be grateful.

Please convince me (OLS matrix derivation)

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Undergrad My basic understanding of set theory

High School Onto set mapping is the surjective set mapping, and into injective?

Undergrad How do E[X] and E[|X|] relate?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers