Newton-Raphson in Least Squares: How is it used? Cost Function?

WWGD · Apr 26, 2019

I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used. Just curious, how is it used? I assume to minimize the cost function, but this function was not made explicit. Anyone know?
Thanks.

Dale · Apr 26, 2019

The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.

RPinPA · Apr 26, 2019

That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.

WWGD · Apr 26, 2019

RPinPA said:

That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.

I think it may have been used with squared residuals, as Dale mentioned.

WWGD · Apr 26, 2019

Dale said:

The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.

Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?

Dale · Apr 26, 2019

WWGD said:

Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?

Yes, or you can use any other optimization method also or a linear solver which is faster. They should all get the same answer to within numerical precision.

RPinPA · Apr 26, 2019

WWGD said:

I think it may have been used with squared residuals, as Dale mentioned.

Yes, that's the definition of least-squares. You write the expression for the sum of the squared residuals in terms of the data values and the unknown parameters. You take the gradient of that sum of squared residuals with respect to the parameters. And you set that gradient equal to 0.

That leads to a linear system, which can be solved in a single step with a linear solver. No need for an iterative method.

RPinPA · Apr 26, 2019

General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.

WWGD · Apr 26, 2019

RPinPA said:

General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.

I understand. But I am using ML and not standard stats. I am not clear on just how ML regression algorithms use iteration, but they do.

RPinPA · Apr 27, 2019

Which Matlab function are you talking about exactly? It still sounds like you're using a nonlinear solver.

The equations I wrote above work out because E is quadratic in the parameters, and therefore its gradient is linear in the parameters. Thus you get a linear system to solve. I've implemented precisely that method in Matlab on numerous equations since it does have a solver (the backslash operator) built in. So all you have to do for your parameter estimate is b = (A'*A) \ (A' * y);

In general nonlinear least squares you're still trying to find the zero of the gradient, but it's no longer linear in the parameters. Thus you're trying to find roots of a general nonlinear function. And that's what Newton-Raphson does.

In one dimension in its simplest form, the iteration to find the zero of ##f(x)## is simply: ##x_{i+1} = x_i - f(x)/f'(x)##. Since our vector ##f(x)## is the gradient of ##E## with respect to the parameter vector ##\mathbf b## its derivative generalizes to the Hessian of ##E##, the second-derivative matrix. And Newton-Raphson becomes in its more general vectorized form:
$$\mathbf b_{i+1} = \mathbf b_i - \alpha \mathbf H^{-1}(E) \nabla E$$
where in general there's an adaptive stepsize parameter ##\alpha## which grows and shrinks in response to the local properties of the function.

You can apply this to a quadratic function such as arises in linear least squares, but then it usually converges in a single step. The update rule with ##\alpha = 1## leads you to the zero of the gradient in one step.

WWGD · Apr 27, 2019

We're using Matplotlib and Graphlab. I realized the iterations are being done over all partitions of data points into test data and training data and we stop when loss function is below threshold. But I still don't see why we iterate. I think we are tweaking the coefficients to reduce MSE, but not sure how.

pbuk · Apr 28, 2019

I haven't done the analysis but I suspect that any or all of the following may be true

the linear solver has a memory requirement ## O(n^2) ## whereas the iterative method is ## O(n) ## and is therefore more scalable
the linear solver is only applicable to linear regression whereas an iterative method can be applied to any curve and is therefore more flexible
Newton-Raphson is very efficient for solving quadratic systems so there is little if any computational penalty vs the linear transformation and solution (and for a non-linear fit can easily be extended to higher order Householder's method if desired)

pbuk · Apr 28, 2019

WWGD said:

I think we are tweaking the coefficients to reduce MSE, but not sure how.

Whichever method you are using to solve for the coefficients, the solution minimizes the Mean Squared Error (that is the definition of Ordinary Least Squares regression as RPinPA pointed out in #7) so I'm not sure either. You could be tweaking them to achieve something else? Or perhaps this is NOT Ordinary Least Squares, you don't seem sure in your original post?

scottdave · May 6, 2019

If there are higher levels of polynomial than just squared, or you have interaction terms, then I think simple linear algebra techniques will not work. A numerical (iterative) method may converge to a solution, though.

Newton-Raphson in Least Squares: How is it used? Cost Function?

1. What is the Newton-Raphson method?

2. How is the Newton-Raphson method used in least squares?

3. What is the cost function in the Newton-Raphson method?

4. How does the Newton-Raphson method improve upon the least squares method?

5. What are the limitations of using Newton-Raphson in least squares?

Similar threads

Hot Threads

Recent Insights