Please convince me (OLS matrix derivation)

  • Context: Graduate 
  • Thread starter Thread starter longrob
  • Start date Start date
  • Tags Tags
    Derivation Matrix
Click For Summary
SUMMARY

The discussion centers on the derivation of the normal equations for Ordinary Least Squares (OLS) estimates, specifically the differentiation of the expression B'X'XB with respect to B. The participants confirm that the derivative is indeed 2X'XB, based on the properties of quadratic forms and the chain rule in matrix calculus. The conversation also touches on the geometric interpretation of OLS and the importance of understanding matrix differentiation rules.

PREREQUISITES
  • Understanding of Ordinary Least Squares (OLS) regression
  • Familiarity with matrix calculus and differentiation
  • Knowledge of quadratic forms in linear algebra
  • Basic concepts of linear projections in vector spaces
NEXT STEPS
  • Study the derivation of the normal equations in OLS regression
  • Learn about matrix differentiation rules, particularly for quadratic forms
  • Explore the geometric interpretation of OLS using orthogonal projections
  • Investigate the application of the chain rule in matrix calculus
USEFUL FOR

Statisticians, data scientists, and anyone involved in regression analysis or linear algebra who seeks to deepen their understanding of OLS and matrix differentiation techniques.

longrob
Messages
51
Reaction score
0
Hi all

In the derivation of the normal equations for Ordinary Least Squares estimates we have B (m x 1 column vector) and X (n x m matrix). Could someone please convince me that the derivative with respect to B of
B'X'XB
is
2X'XB

Thanks !
LR
 
Physics news on Phys.org
I'm sorry, I'm not familiar with the OLS you're referring to. The one I'm familiar is
one used either to find the best approximation (using orthogonal projection), to an
inconsistent system, or to find the line of best fit to a set of data points, i.e., the
least-squares line; the line that minimizes the sum of square residuals, and I don't
recognize your layout, nor the meaning of the derivative B'. Would you expand on your
layout?
 
Hi and thanks for your reply.
Could you take a look here:
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
On page 18/19 you see exactly what (I think) you are referring to in terms of the orthogonal projection. What I am referring to is on the bottom of page 19:
"Differentiating with respect to beta and setting to zero, we find that beta-hat satisfies"
It's this differentiation I am specifically talking about (of the 2nd term o the RHS)

Thanks again
LR
 
Excuse me:
I meant "the 3rd term in the expression above" not "the 2nd term on the RHS".
 
Sorry, longrob, I have not had a chance to look at it, and I'm going to be busy for a while.

Still, from what I saw, it seems like the chain rule should be involved, since d/dB[(B'X'XB)]=

d/dB[((XB)'XB)] , and this is a product together with a chain rule; we have g(B)=(XB)'

and f(B)=XB (where X is a contant re B ), so that g(B) is a composition of scaling and

transposition. I'll look into it if I have a chance.
 
I'm sorry, I have been a bit confused about this myself; the best I have been able
to come up with, has been this; maybe/hopefully you or someone else can finish this
and/or make it more precise:

In B'X'XB , X'X is a square nxn matrix, and, basically, B' is the same as B, only
expressed in a different format , so what you get is a quadratic form B'MB , or,
another approach is that d/dB(B'MB)= d/dB([BM]B]= MB+BM , and somehow
(despite the product not being commutative), this is equal to 2MB=2XX'B .

Sorry, I have not been able to figure it out better.
 
Hi Bacle, thanks for your messages. I'm glad I'm not the only one who is a bit confused by it.

For completeness and the benefit of others, I'll explain the setup so that it's not necessary to refer to the link I posted.

We have
y = XB + e

where y is a n x 1 column vector of responses
X is a n x (k+1) matrix with a leading column of 1s for the regression intercept and the remaining columns being nx1 vectors of the observed values of the explanatory variables.
B is a (k+1) x 1 column vector of regression coefficients.
e is a n x 1 column vector of errors

We wish to find the values for the elements of B which minimise the sum of the squares errors.
That is, we wish to minimise e'e
Now,
e'e = (y - XB)' (y - XB)

= (y'- B'X') (y - XB)

= y'y - y'XB - B'X'y + B'X'XB

Now, since y'XB and B'X'y are 1 x 1 , they are equal, so:

= y'y - 2y'XB + B'X'XB

Differentiating this with respect to B and setting to zero yields

X'XB = X'y

which are the so-calleed "normal equations" which upon rearrangement gives the least squares estimator for B (provided that X'X is invertible)

It is the last step above which I don't get - that the derivative of B'X'XB with respect to B is 2X'XB. Obviously B'X'XB is a quadratic form, so it does make some intuitive sense, but I need more than that...

Thanks for reading
LR
 
Last edited:
Longrob, I do know the more geometric, and linear-algebraic derivation, using
orthogonal projection, if you're interested. You're working in a normed space (V,||.||).

The idea is that you're given an inconsistent linear system Ax=b; while there is no
actual solution x , there is a second best, which is finding the vector x_
that best approximates the solution, i.e., to find x_ such that, for any x':

||Ax'-b||>||Ax_-b||

And this x_ is actually the projection of b into the row space Ax
(if b were in the row space, Ax=b would have a solution).

With this layout, I know how to show that the line y=Ax^+B^
is the closest line to the collection of points in the LS sense (it is a sort-of inverse
problem , in that we are looking for the subspace--the least squares line/subspace
-- instead of having the space given as Ax).

Anyway, let me know if you're interested in this type of derivation. I got a bit
confused with the justification for differentiating matrices as if they were standard
variables, BTW; do you know if this is some sort of Borel Functional Calculus or
something similar to that?.
 
Hi again

I need to stick with the pure linear algebraic derivation at the moment, but thanks anyway. I may come back to you later on that, as I am also interested in the geometric interpretation.

Anyway, I think I have solved it. Basically, it revolves around the "rule" that the derivative of b'Ab with respect to b is (A+A')b, or in the case at hand, where A is symmetric, 2Ab. This can be proved as follows.

A is an n x n symmetric matrix. b is a n x 1 column vector.

\mathbf{b'Ab}=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset{j=1}{\sum}}b_{i}a_{ij}b_{j}

=\overset{n}{\underset{i=1}{\sum}}b_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i}{\sum}}b_{i}a_{ij}b_{j}

[sigh]
I can't get LaTex to display properly, as you can see above. I created it in Lyx and have tested it here http://www.codecogs.com/latex/eqneditor.php that it displays correctly. I can't post the rest of the proof in Latex as it just displays the Latex code:
\frac{\partial(\mathbf{\mathbf{b}'}\mathbf{Ab})}{\partial b_{k}}&=2b_{k}a_{kk}+2\overset{n}{\underset{i\neq k}{\sum}}a_{ki}b_{i}
What am I doing wrong with this Latex ?

The rest of the proof is to form the partial derivative with respect to a general b sub k and to notice that this is just 2 times the elements of the vector Ab, so the derivative with respect to the whole vector b is 2 times the vector Ab.
 
  • #10
OK, here goes again. I think I fixed the LaTex problems:
let \mathbf{A=X^{\textrm{T}}X} where \mathbf{A} is a n\times n square symmetric matrix with elements a_{ij}. \mathbf{\mathbf{\boldsymbol{\beta}}} is the n\times 1 column vector. Expanding out.

\begin{align}\mathbf{\boldsymbol{\beta}^{\textrm{T}}A\boldsymbol{\beta}}&amp;=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset {j=1}{\sum}}\beta_{i}a_{ij}\beta_{j}<br /> \end{align}

\begin{align}=\overset{n}{\underset{i=1}{\sum}} \beta_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i} {\sum}}\beta_{i}a_{ij}\beta_{j}\end{align}

Then, form the partial derivative of \mathbf{\mathbf{\mathbf{\boldsymbol{\beta}}^ {\textrm{T}}}\mathbf{A\boldsymbol{\beta}}} with respect to \beta_{k}, an arbitrary element of \mathbf{\mathbf{\boldsymbol{\beta}}}

\begin{align}\frac{\partial(\mathbf{\mathbf{\beta}^{\textrm{T}}}\mathbf{A\boldsymbol{\beta}})} {\partial\beta_{k}}&amp;=2\beta_{k}a_{kk}+2 \overset{n}{\underset{i\neq k}{\sum}}a_{ki}\beta_{i}\end{align}

\begin{align}=2\overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i}\end{align}

But \overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i} is the kth element of the vector \mathbf{A}\boldsymbol{\beta} so it follows that

\begin{align}\frac{d( \mathbf{\mathbf{\beta}^{\textrm{T}}} \mathbf{A\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}&amp;=2 \mathbf{A\boldsymbol{\beta}}\end{align}

Thus

\begin{align}\frac{d(\mathbf{\beta^{\textrm{T}} X^{\textrm{T}}X\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}= 2\mathbf{ {\mathbf{ X^{\textrm{T}}X\boldsymbol{\beta}}}} \end{align}

If anyone can check/correct I would be grateful.
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 53 ·
2
Replies
53
Views
3K