Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Please convince me (OLS matrix derivation)

  1. Oct 1, 2011 #1
    Hi all

    In the derivation of the normal equations for Ordinary Least Squares estimates we have B (m x 1 column vector) and X (n x m matrix). Could someone please convince me that the derivative with respect to B of

    Thanks !
  2. jcsd
  3. Oct 1, 2011 #2
    I'm sorry, I'm not familiar with the OLS you're referring to. The one I'm familiar is
    one used either to find the best approximation (using orthogonal projection), to an
    inconsistent system, or to find the line of best fit to a set of data points, i.e., the
    least-squares line; the line that minimizes the sum of square residuals, and I don't
    recognize your layout, nor the meaning of the derivative B'. Would you expand on your
  4. Oct 1, 2011 #3
    Hi and thanks for your reply.
    Could you take a look here:
    On page 18/19 you see exactly what (I think) you are referring to in terms of the orthogonal projection. What I am referring to is on the bottom of page 19:
    "Differentiating with respect to beta and setting to zero, we find that beta-hat satisfies"
    It's this differentiation I am specifically talking about (of the 2nd term o the RHS)

    Thanks again
  5. Oct 1, 2011 #4
    Excuse me:
    I meant "the 3rd term in the expression above" not "the 2nd term on the RHS".
  6. Oct 1, 2011 #5
    Sorry, longrob, I have not had a chance to look at it, and I'm going to be busy for a while.

    Still, from what I saw, it seems like the chain rule should be involved, since d/dB[(B'X'XB)]=

    d/dB[((XB)'XB)] , and this is a product together with a chain rule; we have g(B)=(XB)'

    and f(B)=XB (where X is a contant re B ), so that g(B) is a composition of scaling and

    transposition. I'll look into it if I have a chance.
  7. Oct 1, 2011 #6
    I'm sorry, I have been a bit confused about this myself; the best I have been able
    to come up with, has been this; maybe/hopefully you or someone else can finish this
    and/or make it more precise:

    In B'X'XB , X'X is a square nxn matrix, and, basically, B' is the same as B, only
    expressed in a different format , so what you get is a quadratic form B'MB , or,
    another approach is that d/dB(B'MB)= d/dB([BM]B]= MB+BM , and somehow
    (despite the product not being commutative), this is equal to 2MB=2XX'B .

    Sorry, I have not been able to figure it out better.
  8. Oct 2, 2011 #7
    Hi Bacle, thanks for your messages. I'm glad I'm not the only one who is a bit confused by it.

    For completeness and the benefit of others, I'll explain the setup so that it's not necessary to refer to the link I posted.

    We have
    y = XB + e

    where y is a n x 1 column vector of responses
    X is a n x (k+1) matrix with a leading column of 1s for the regression intercept and the remaining columns being nx1 vectors of the observed values of the explanatory variables.
    B is a (k+1) x 1 column vector of regression coefficients.
    e is a n x 1 column vector of errors

    We wish to find the values for the elements of B which minimise the sum of the squares errors.
    That is, we wish to minimise e'e
    e'e = (y - XB)' (y - XB)

    = (y'- B'X') (y - XB)

    = y'y - y'XB - B'X'y + B'X'XB

    Now, since y'XB and B'X'y are 1 x 1 , they are equal, so:

    = y'y - 2y'XB + B'X'XB

    Differentiating this with respect to B and setting to zero yields

    X'XB = X'y

    which are the so-calleed "normal equations" which upon rearrangement gives the least squares estimator for B (provided that X'X is invertible)

    It is the last step above which I don't get - that the derivative of B'X'XB with respect to B is 2X'XB. Obviously B'X'XB is a quadratic form, so it does make some intuitive sense, but I need more than that.......

    Thanks for reading
    Last edited: Oct 2, 2011
  9. Oct 2, 2011 #8
    Longrob, I do know the more geometric, and linear-algebraic derivation, using
    orthogonal projection, if you're interested. You're working in a normed space (V,||.||).

    The idea is that you're given an inconsistent linear system Ax=b; while there is no
    actual solution x , there is a second best, which is finding the vector x_
    that best approximates the solution, i.e., to find x_ such that, for any x':


    And this x_ is actually the projection of b into the row space Ax
    (if b were in the row space, Ax=b would have a solution).

    With this layout, I know how to show that the line y=Ax^+B^
    is the closest line to the collection of points in the LS sense (it is a sort-of inverse
    problem , in that we are looking for the subspace--the least squares line/subspace
    -- instead of having the space given as Ax).

    Anyway, let me know if you're interested in this type of derivation. I got a bit
    confused with the justification for differentiating matrices as if they were standard
    variables, BTW; do you know if this is some sort of Borel Functional Calculus or
    something similar to that?.
  10. Oct 2, 2011 #9
    Hi again

    I need to stick with the pure linear algebraic derivation at the moment, but thanks anyway. I may come back to you later on that, as I am also interested in the geometric interpretation.

    Anyway, I think I have solved it. Basically, it revolves around the "rule" that the derivative of b'Ab with respect to b is (A+A')b, or in the case at hand, where A is symmetric, 2Ab. This can be proved as follows.

    A is an n x n symmetric matrix. b is a n x 1 column vector.


    [itex]=\overset{n}{\underset{i=1}{\sum}}b_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i}{\sum}}b_{i}a_{ij}b_{j}[/itex]

    I can't get LaTex to display properly, as you can see above. I created it in Lyx and have tested it here http://www.codecogs.com/latex/eqneditor.php that it displays correctly. I can't post the rest of the proof in Latex as it just displays the Latex code:
    [itex]\frac{\partial(\mathbf{\mathbf{b}'}\mathbf{Ab})}{\partial b_{k}}&=2b_{k}a_{kk}+2\overset{n}{\underset{i\neq k}{\sum}}a_{ki}b_{i}[/itex]
    What am I doing wrong with this Latex ?

    The rest of the proof is to form the partial derivative with respect to a general b sub k and to notice that this is just 2 times the elements of the vector Ab, so the derivative with respect to the whole vector b is 2 times the vector Ab.
  11. Oct 3, 2011 #10
    OK, here goes again. I think I fixed the LaTex problems:
    let [itex]\mathbf{A=X^{\textrm{T}}X}[/itex] where [itex]\mathbf{A}[/itex] is a [itex]n\times n[/itex] square symmetric matrix with elements [itex]a_{ij}[/itex]. [itex]\mathbf{\mathbf{\boldsymbol{\beta}}} [/itex] is the [itex]n\times 1[/itex] column vector. Expanding out.

    [itex]\begin{align}\mathbf{\boldsymbol{\beta}^{\textrm{T}}A\boldsymbol{\beta}}&=\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset {j=1}{\sum}}\beta_{i}a_{ij}\beta_{j}

    [itex]\begin{align}=\overset{n}{\underset{i=1}{\sum}} \beta_{i}^{2}a_{ii}+\overset{n}{\underset{i\neq j}{\sum}}\overset{n}{\underset{j\neq i} {\sum}}\beta_{i}a_{ij}\beta_{j}\end{align}[/itex]

    Then, form the partial derivative of [itex]\mathbf{\mathbf{\mathbf{\boldsymbol{\beta}}^ {\textrm{T}}}\mathbf{A\boldsymbol{\beta}}}[/itex] with respect to [itex]\beta_{k}[/itex], an arbitrary element of [itex]\mathbf{\mathbf{\boldsymbol{\beta}}} [/itex]

    [itex]\begin{align}\frac{\partial(\mathbf{\mathbf{\beta}^{\textrm{T}}}\mathbf{A\boldsymbol{\beta}})} {\partial\beta_{k}}&=2\beta_{k}a_{kk}+2 \overset{n}{\underset{i\neq k}{\sum}}a_{ki}\beta_{i}\end{align}[/itex]


    But [itex]\overset{n}{\underset{i=1}{\sum}}a_{ki}\beta_{i}[/itex] is the [itex]k[/itex]th element of the vector [itex]\mathbf{A}\boldsymbol{\beta}[/itex] so it follows that

    [itex]\begin{align}\frac{d( \mathbf{\mathbf{\beta}^{\textrm{T}}} \mathbf{A\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}&=2 \mathbf{A\boldsymbol{\beta}}\end{align}[/itex]


    [itex]\begin{align}\frac{d(\mathbf{\beta^{\textrm{T}} X^{\textrm{T}}X\boldsymbol{\beta}})} {d\mathbf{\boldsymbol{\beta}}}= 2\mathbf{ {\mathbf{ X^{\textrm{T}}X\boldsymbol{\beta}}}} \end{align}[/itex]

    If anyone can check/correct I would be grateful.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook